Buzzword Watch: "Culturomics" and "Ngram"

Last week, an exciting new tool for analyzing the history of language and culture was unveiled by Google. They call it the "Ngram Viewer," and it's an interface to study the enormous corpus of historical texts scanned by Google Books. The Ngram Viewer was rolled out in conjunction with a paper in the journal Science introducing the field of "culturomics." Dennis Baron has weighed in on the significance of this development for researchers. But what about those peculiar words, culturomics and ngram?

The authors of the Science paper, "Quantitative Analysis of Culture Using Millions of Digitized Books " (free registration required), define culturomics as "the application of high-throughput data collection and analysis to the study of human culture." The culture part of culturomics is straightforward enough, but what about the -omics? Many observers in the wake of last week's publicity barrage have been stymied by that. The esteemed language expert David Crystal, for instance, initially surmised on his blog that culturomics is "presumably based on ergonomics, economics, and suchlike." Dan Clayton, a British language researcher (and friend of the VT) similarly speculated that the new word is "a blend of culture and economics, with a bit of linguistics thrown in."

Full disclosure: I was lucky enough to get a preview of the Science paper a couple of months ago from a presentation by the lead researchers, the young Harvard scholars Jean-Baptiste Michel and Erez Lieberman-Aiden, so by the time the paper was published last week I had advance warning about culturomics. And I already knew that it was intended to be pronounced with a long "o" (cultur-OH-mics), a clue that it has nothing to do with economics or ergonomics. Rather, the model is genomics: the study of organisms in terms of their full DNA sequences, or genomes.

Further disclosure: among my other comments on their paper presentation, I told Jean-Baptiste and Erez that I didn't think culturomics was the most felicitous choice for the new field of study they envisioned. The connection to genomics might be apparent to those in the biosciences who have already seen the proliferation of other words ending in -omics, such as proteomics, the study of the proteome (the full set of proteins encoded by a genome). This Wikipedia page lists a raft of other -omics topics, such as connectomics, interferomics, and transcriptomics. But despite the large number of -omics coinages in biology and allied sciences, a lay audience would not immediately pick up on the meaning of the suffix, especially if they only see the word in print rather than hearing the tell-tale long "o" sound.

The temptation to read -omics as connected to economics is a strong one, given the widespread use of the -(o)nomics combining form. I wrote about this on OUPblog a few years ago:

As for different flavors of economics, the best-selling book Freakonomics has successfully popularized the title coinage, to stand alongside such others as infonomics, bionomics, and greenomics. But most of the common forms ending in -nomics attach to the names of prominent politicians on the model of Reaganomics (and Nixonomics before it). Hence in the US we get Clintonomics, Bushonomics, Kerrynomics, and Rubinomics (after Clinton's Treasury Secretary Robert Rubin). It's popular outside of the US as well, as illustrated by Rogernomics (after New Zealand Finance Minister Roger Douglas), Thaksinomics (after Thailand's Prime Minister Thaksin Shinawatra), and Manmohanomics (after Indian Prime Minister Manmohan Singh).

Note that in the above cases the -n- of economics always makes an appearance, even if it overlaps with a word ending in n (like Reagan, Clinton, or Rubin). So if we were looking for a shortened form of cultural economics, we'd more likely go with something like culturonomics. Still, perhaps thanks to the success of Freakonomics, the economics connection was many people's default assumption when they first read of culturomics. Despite all this, I think the groundbreaking research on display in the Science paper and the Ngram Viewer is significant enough to overcome this initial flurry of confusion and establish culturomics in the public imagination.

Ngram also likely mystified those who don't know much about computational linguistics. An ngram (more often hyphenated as n-gram) is a sequence of n consecutive words appearing in a given text. When investigating a corpus of texts, the maximum value of n needs to be big enough to provide a window to appreciate the immediate context of analyzed words. In the case of the Google tool, n can range from 1 to 5, so the analysis is based on strings of no more than five words: the "5-grams" in the Declaration of Independence would include "When in the course of," "in the course of human," "the course of human events," and so forth.

Google's been tinkering with n-grams for quite a while: they've already released a massive 5-gram dataset to language researchers based on the billions of online texts that their search engine indexes. The new public tool applies that same n-gram approach to a subset of the 15 million or so books that have been scanned by Google and partner libraries. When the Ngram Viewer made its debut on Google Labs last week, most visitors probably ignored the puzzling name (made all the more puzzling by the lack of a hyphen in ngram) and went right ahead generating fun graphs of the change of word usage over the past few centuries. (Check out the Ngrams Tumblr feed for some great examples.)

I've noticed that, because of the opacity of the term n(-)gram, a lot of people are using it to refer to the graphs that Google's tool generates. Thus, if someone says on Twitter, "Check out this cool ngram!" you can guess they're talking about a particular line graph rather than a cool string of words. Semantic change often arises because of misunderstandings, and we may be witnessing a rapid shift in the meaning of n(-)gram from its previous technical sense. As for culturomics? Well, as The New York Times put it last week, "in 20 years, type the word into an updated version of the database and see what happens."

You can hear me talk more about the Ngram Viewer on WNYC's "The Brian Lehrer Show."

Ben Zimmer is language columnist for The Wall Street Journal and former language columnist for The Boston Globe and The New York Times Magazine. He has worked as editor for American dictionaries at Oxford University Press and as a consultant to the Oxford English Dictionary. In addition to his regular "Word Routes" column here, he contributes to the group weblog Language Log. He is also the chair of the New Words Committee of the American Dialect Society.

Click here to read other articles by Ben Zimmer

Sign up now (it’s free!)