Community Embraces New Word Game at Mid-Year Play Day This past Sunday, families at Takoma Park’s Seventh Annual Mid-Year Play Day had the opportunity to experience OtherWordly for the first time. Our educational language game drew curious children and parents to our table throughout the afternoon. Words in Space Several children gathered around our iPads […]
Read moreThe Linguabase is a comprehensive vocabulary database that powers our word games. It provides word lists with difficulty rankings, rich definitions, and semantic associations—everything needed to build word puzzles that go beyond simple letter arrangement. When a player sees "hiking," they might emphasize nature (scenery, trails) or exercise (fitness, exertion)—same word, different contextual flavors. A good word cloud represents the whole word, not just one angle.
Building this kind of database by hand would be absurdly expensive. The first 10 associations for any word are easy (apple → fruit, red, tree). The next 10 are harder. The next 10 are harder still. This nonlinear difficulty means a skilled lexicographer using modern corpus tools might spend an hour per word to build out 50 associations. Multiply that by 400,000 words and you get 200 person-years of work. At $50/hour, even with a team, you'd spend over $20 million coordinating quality. This is why nobody has done it.
The Linguabase is the first database purpose-built for word puzzle development, and the most comprehensive. It offers six data layers: curated word lists with difficulty scores, readable definitions, content safety filters, semantic associations (~50 related terms per word), morphological word families, and 1.46 million usage examples. We've been building it since 2010, developing a process that produces data that works for gameplay—not just dictionary definitions.
Most words can be associated in multiple directions. For example, if a word embodies three meanings, the Linguabase expands each sense separately, then interleaves them:
"Space" could go toward astronomy (cosmos, galaxy, astronaut), physical room (parking, storage, breathing room), typography (kerning, leading, margin), or abstract expanse (void, infinity, emptiness). A player might think of any of these—so the database needs to cover all the directions, not just the obvious one.
Why can't you just ask ChatGPT?
You might think you could just ask ChatGPT. But LLMs give you what people typically want—ask for pets, you get cats, dogs, and hamsters. That's by design, and usually it's right. But puzzles need variety across meanings, not the most likely answer. The good news for puzzlemaking is that LLMs are much better at judging variety than producing it. Here's what generation looks like:
These lists are valid but banal. "Espresso, caffeine, brew, aroma"—all one sense of coffee. Where's morning? Tea? Bitter? Wake up?
What is a "good" word association?
Thesauruses give you synonyms, but synonyms aren't what we need. "Happy" has synonyms (joyful, glad). But "apple" doesn't—there's no other word that means apple. For nouns especially, we need free association: what comes to mind when you think "apple"? That's different from synonyms. And it raises a hard question: how do you even define "related"?
There's no correct answer. Rhinoceros, savanna, and mammoth are all related to elephant—but in different ways (similar animal, habitat, evolutionary cousin). Even without a single right answer, though, you can tell a good word cloud from a bad one. The elephant cloud should probably include all three directions.
Here's a harder question:
How We Built It
The Linguabase combines two ingredients: human-written associations (especially for common words) and LLM-assisted curation from 130 million API calls to OpenAI's ChatGPT and Anthropic's Claude.
We started in 2010 by pulling from existing resources: Roget's Thesaurus, Princeton's WordNet (117,000 synonym sets), Wiktionary, and specialized databases like the NASA Thesaurus. This gave us good coverage for common words like "earth" or "courage." But huge gaps remained in the vocabulary.
In 2013, we tried to fill those gaps with automation. We received a grant of 2.3 million supercomputer hours from NSF's XSEDE program and ran statistical text analysis—the same technique Netflix used for "movies you might like" (latent Dirichlet allocation). We analyzed which words appear together in large text corpora. This produced useful candidates, but also a lot of noise: misspellings from source texts, compound words split incorrectly, uppercase/lowercase confusion. Statistical methods have inherent limits.
We shipped our first game with a limited vocabulary—it was all we could build with pre-LLM methods. When LLMs arrived, we could finally scale up. That enabled our second game, which gives players more freedom and draws from a much larger word space.
But give an LLM a list and ask it to judge, and it works. It can compare items directly and recognize which associations span different senses.
So we use LLMs as editors, not generators. We gather a large pool of candidate words from various sources, give the pool to an LLM along with information about the target word's different meanings, and ask it to sort and rank. This works well. Example: give an LLM the words "walk, stomp, stride, February, April, month" and ask it to sort them into "march" (walking) vs. "March" (the month)—it does this perfectly.
The process has three phases: expand (gather lots of candidate words), audit (have LLMs evaluate them), contract (keep only the good ones). We repeat this. The ranking uses LLM judgments plus other signals:
Generating candidates
- Professional curation: Our lexicographer Orin Hargraves and graduate students hand-built 5,000 curated lists (like "types of sushi" or "architectural styles")
- Reference works: WordNet, Wiktionary, NASA Thesaurus, Getty Art & Architecture Thesaurus, medical terminology databases, 70+ specialized sources
- Library of Congress subject headings: Since 1897, LOC catalogers have organized 17 million books into 648,460 subject categories. Each category becomes a seed for LLM expansion—"18th-century colonial architecture" yields domain-specific words that pure generation misses
- Cross-referencing: If word A lists word B as related, we check whether B should list A
The Library of Congress approach shows our philosophy. We're not asking LLMs to create from nothing. We're leveraging chains of human work: authors wrote books, LOC catalogers organized those books into subjects, we use those subjects as anchors for LLM expansion. The chain: Authors → Books → LOC Catalogers → Subject Headings → LLM Expansion → Word Lists. The LLM's job is to fill in a neighborhood that humans already defined.
Some associations require human curation. LLMs have cultural biases—ask for "breakfast foods" and you get American breakfast: bacon, eggs, pancakes. You don't get shakshuka (Middle East), congee (China), dosa (India), or huevos rancheros (Mexico). These require human knowledge to add.
Some words are actually different words depending on capitalization. "march" (walking) vs. "March" (the month). "swift" (fast) vs. "Swift" (programming language) vs. "SWIFT" (international banking network). We've identified 3,509 English words with two capitalization variants and 86 words with three—cases like swat/Swat/SWAT (to hit, a Pakistani valley, a police unit). LLMs can sort these correctly if you prompt them carefully—the challenge is knowing which words need checking in the first place.
Some words connect to everything. "Mammal" could appear in the word cloud for dog, cat, whale, elephant—every mammal. If we didn't penalize these superconnectors, they'd crowd out more interesting associations. So we downweight words that have too many inbound links, making room for distinctive connections.
Ranking and mixing
- LLM validation: 130 million API calls, focused on judging (not generating)—checking if relationships are real, separating different word senses, flagging bad connections
- Morphology filtering: Removing duplicates like run/runs/running
- Word uniqueness: Penalizing overrepresented words (the superconnector problem above)
Test your intuition about what words are over-linked:
Balanced word clouds
For any word, we want the associations to feel right and play right, and cover all the angles a player might have in mind. The challenge is that words carry multiple meanings for different reasons.
Homographs are unrelated words that happen to share spelling—"pupil" the student has nothing to do with "pupil" the eye part. Polysemy is one word whose meaning branched over time—"mouth" started as the body part, then extended to river mouths and cave mouths, all variations on "opening." Facets are the most common need in English, different aspects of the same meaning—"elephant" is one thing, but you might think of its anatomy, its habitat, its cultural symbolism, or its behavioral traits. Our algorithm expands each sense separately, then interleaves them, maximizing the chance players find what they're looking for regardless of which angle they had in mind.
Here are real examples from the Linguabase—the top 15 associations for selected words:
Most English words connect in 7 hops
We analyzed the network structure of the Linguabase and found a "small world" effect: 76% of random English word pairs connect in seven hops or fewer. The average path length is just 6.43 steps.
What do these paths actually look like?
This is why word association games work. Players feel like they can get from any word to any other—and they're right. The paths exist and they're short. "Batman" to "inspect," "sugar" to "peace"—these seem like stretches, but real paths connect them. English meaning is more tightly connected than people realize.
Research Timeline
- 2013–2014: Used NSF supercomputer grant for foundational computational linguistics work
- 2017: Started building the word association database, combining algorithmic methods with human curation
- 2023 onward: LLM-assisted evaluation at scale—using LLMs to audit for errors, inappropriate content, and imbalanced word senses
Funding
We funded this ourselves over ten years of development, with help from grants and donated computing resources:
Licensing
We offer two licensing tracks. Data licensing provides raw vocabulary, definitions, associations, and more—delivered as TSV, SQLite, or JSON files for offline use in your app. Puzzle licensing generates custom puzzle content tailored to your game mechanics—you define the rules, we generate thousands of unique levels.
The data specs: 400K production-deployed words (1.5M maintained), 100 million weighted relationships with A–E quality grades, sense-balanced associations, 291K false cognates removed, and two-tier content blocklists. Human-curated, LLM-validated, actively maintained, and tested in shipping games.
Need word data or custom puzzles for your game?
Six data layers. Two licensing tracks. Built for word game developers.
Visit linguabase.org
