In our coverage of the first Republican presidential candidates' debate last month, we used our vocabulary list builder to find the most relevant word each candidate used. It produced some interesting results, creating what we called a "word's-eye" view of what defined each candidate's core message. (Our findings appeared in The Observer, as a listicle in Mental Floss, and as a fun quiz created by The Washington Post.)

Before Republicans take to the stage for the next debate, we want to explain what we mean by "relevance." Understanding this concept sheds light not just on how we're analyzing candidates' speech, but also on how we're pulling vocabulary from academic and literary texts.

Our list builder is a tool that automatically generates vocabulary lists. Users paste in up to 100 pages of any electronic text and, with a few clicks, are presented with a vocabulary list that's ready to learn and share, complete with sentence examples drawn from the text the list was created from. While they're building the list, the user is given a chance to decide which words on the list they want it to include; they can make this selection word-by-word, or, to keep it simple, opt for the top 10, 25, or 50 most relevant words, determined by comparing the frequency with which the words on the list appear in the text to the frequency with which they appear in the 3.2 billion-word corpus we've built over time and are constantly updating. Users also have the option to sort and select the words on a list in a host of other ways, but relevance is useful because it answers the question, Which words will best help me understand this particular text? 

It's the text-focused quality of relevance that allows us to find words that are particularly important to candidates' speech. When crafting talking points, candidates choose their words carefully and make sure to use language the most plain speaking among us can understand. At the same time, they must make their messages stand out, and use words that are specific enough to communicate what's unique about their point of view. The more specific a concept, the more specific the word the candidate needs to find to describe it. Thus, the most relevant vocabulary also tends to be the word to watch or, if you like, the one word that best encapsulates a candidate's core message.

To see how relevance hones in on a text's core message, check out what we found when we pasted the first few chapters of Moby Dick into our list builder. The top ten most relevant words in the chapters includes cannibal, cenotaph, and impenitent, but in the number one spot, we see the word harpoon

Imagine reading about Ahab's quest to kill the great white whale without knowing what violence the word harpoon entails. Would you guess the word to be a variant of harp, as in the musical instrument? Perhaps the sailors were hoping to strum the whale into a stupor? Or would you guess that the word is a variant of harp the verb, and that the sailors would row over to the whale in longboats and deliver a repetitive, never-ending monologue of complaint such that the whale would dive for deep waters, never to be seen by man again? Both those scenarios would be pretty confusing! 

Let's look at a few more classics to see our list builder's relevance sort in action. When we put Lewis Carroll's Alice in Wonderland, Jane Austen's Pride and Prejudice, F. Scott Fitzgerald's Great Gatsby, and Charles Dickens' Oliver Twist through our list builder, it drills down to one-word versions of those great works of literature.

And while we're on the subject of one-word versions of the classics (for, shall we say, extremely reluctant readers), we have produced the shortest synopsis ever for what might be one of the longest works of great literature ever penned, War and Peace

The relevance information is only as good as the corpus, and we really stand behind it. It's a resource we make use of on in several other important ways. Besides using it to determine a word's relevance to a given text, our corpus data is essential to running the algorithms that personalize learning in our game, helping us match players to words at their vocabulary level, and makes sure the game is not too easy or too hard.

Our corpus also provides the data for the frequency information and example sentences that appear on each word's definition page in the Dictionary. Going back to harpoon, our Dictionary page shows that the word is not that rare. You can expect it to appear once in every 2,615 pages of text. 

Which helps explain why none of the usage examples appearing on harpoon's definition page, even when restricted to examples drawn from fiction, are taken from Moby Dick

Sorry, Melville. Words, it turns out, even when linked to great works of literature, tend to take on lives of their own.