Our language database

Screen Shot 2016-03-11 at 11.55.59 PMThe IDEA Linguabase is a large lexicon for use in consumer-facing and natural language processing applications. Built over a period of four years by our team of programmers and lexicographers, it contains definitions and weighted word relations for over 500,000 terms, along with a data graph of over 50 million word associations.

The database suggests related words for traditional thesaurus topics, as well as hundreds of thousands of terms that are omitted from typical thesauri. It covers adjectives, like “happy,” “joyful,” “cheerful,” and over 300,000 additional nouns, such as “golf club,” “iron,” and “brassie.” It contains both closely similar words, akin to synonyms and near-synonyms – think “house,” “domicile,” and “lodge” – as well as items of the same type – “house,” “bungalow,” “villa” – and more distantly associated words: “house,” “quarter,” “dwell.” All relations have a decimal weight ranging from 1, being very similar, to 0, indicating a low confidence of association.

The motivation for Linguabase was the lack of an existing database that met the requirements for our language app projects. Describing English is notoriously expensive, requiring massive amounts of labor from highly educated, specialized talent. In 1985, Princeton began creating WordNet, an influential, large-scale Open Source language database. This electronic reference was first published in 1991. WordNet and related projects, like Framenet and VerbNet, are a mainstay of natural language processing research. While WordNet is included in the IDEA Linguabase, it is of limited use as its intention is to define sets of related terms rather than act as a thesaurus.

Project activities

The IDEA Linguabase combines several publicly available sources in a unique way, and adds our own lexicographic work.

Some steps in our process required an intense amount of computing time and power. A single desktop computer could process each pool of text in 5 to 10 minutes, not including testing and refinements; it would have taken over a decade of computing time to analyze all the words. To accelerate this process, we used hundreds of thousands of hours of supercomputer power from the NSF-funded Extreme Science and Engineering Discovery Environment (XSEDE), grant #IRI130011. Using their supercomputer, we were able to spread this workload across thousands of processing cores, yielding over 30 million ranked word relationships in a matter of days.

The heart of the database is a word list, with definitions, based on crowdsourced content. We included words, compound words, and idioms from Wiktionary, as well as major encyclopedic terms from Wikipedia, to create our unabridged dictionary.

We then analyzed several dozen free, open source, and commercial thesauri, including WordNet, the NASA Thesaurus, and data from the National Library of Medicine and the Library of Congress. These sources helped us find over a million word relationships.

We sought to capture a foundation of broader associations, like the connection between “horse” and “stable” or “cat” and “meow.” To do this, we built a large corpus of English prose from multiple genres. For each of our terms, we extracted a pool of matching sentences and paragraphs. We then used topic modeling to discover abstract topics in collections of text. These topic models examined the statistics of words in each collection, revealing clusters of words likely to appear together.

In addition to natural language processing, we conducted new lexicographic work focused on cultural expansiveness. We defined thousands of groups of related terms, from denominations of Christianity to human bones to high-pitched sounds, that go beyond synonyms. We identified thousands of definitions and relationships for the most common words in English (so-called stopwords), which are typically omitted from word databases. We made a comprehensive list of vulgar and offensive terms, and identified thousands of antonym pairs. We subjected our work to intense editorial review to catch errors from the natural language processing, such as those caused by compound words (e.g., “New York,” which was broken into “New” and “York”) or errors from faulty optical character recognition. This work expanded the word relationships for more than 80,000 terms, while also fixing thousands of errors.

In order to identify words that are often used together (so-called collocations), we built on the data provided by Google’s NGrams project, which is their analysis of data from over 5 million scanned books. We identified common usages and provided words that typically precede and follow a given term.

We analyzed phonetic sounds to produce rhymes, identify word families, and generate various forms of wordplay, such as words with common starting or ending letters, words whose letters appear as part of or contain other words, and words with curious letter patterns.

Finally, we used artificial intelligence for fine tuning and to create validation datasets used to build on current and future generations of generative AI.

Our research created the Linguabase engine for our language-related mobile apps, but we are also interested in ideas that other developers or publishers may have for our database.