You may have Facebook friends who have done the “Here are the top words from my Facebook status messages!” thing, where it lists the words they’ve most commonly used in telling the world about their lives. Interesting as that may or may not be, imagine something similar being done with four percent of all books ever published. That’s what a team of researchers from Harvard University, Google, Encyclopaedia Britannica, and the American Heritage Dictionary have done. The resulting dataset is made up of the full text of about 5.2 million books, 72 percent of that text being in English, with French, German, Chinese, Russian, and Hebrew making up the rest. Analyzing that dataset, a practice that the researchers call “culturomics” (a play on genomics), has revealed some fascinating things about the history of our species.
“Now that a significant fraction of the world's books have been digitized, it's possible for computer-aided analysis to reveal undiscovered trends in history, culture, language, and thought,” said Jon Orwant, engineering manager for Google Books.
One of those trends is the adoption of “unsanctioned” new words. It was found that about 8,500 new words enter the English language annually, which caused a 70 percent growth in the lexicon between 1950 and 2000. Approximately 52 percent of that lexicon, however, consists of words not found in dictionaries.
It also turns out that people in 2000 likely cared a lot less about the year 1950, than people in 1950 cared about the year 1900 – overall, humanity pays less attention to the past with every passing year. After tracking the frequency at which every year from 1875 to 1975 appeared in text, the team discovered that mention of the year 1880 didn’t fall by more than half until 32 years later, whereas mention of the year 1973 fell by half within just a decade.
Years aren’t the only things that are getting forgotten sooner, either. Celebrities, although they’re now achieving fame at an earlier average age, lose that fame much more quickly than ever before. Culturomics also shows up instances of things that entire nations were made to forget – from 1936 to 1944, mention of Jewish artist Marc Chagall increased by fivefold in English text, while he was mentioned just once in German literature. Similar suppression was found with Russian mention of Leon Trotsky, Chinese mention of Tiananmen Square, and American mention of the blacklisted “Hollywood Ten” actors in 1947.
If you have a word that you’re curious about, you can now go type it into the Books Ngram Viewer, an online interface that Google Labs created based on the study. A graph will instantly show you how the usage frequency of that word has changed over the past several hundred years.
Not surprisingly, it turns out that the word “groovy” peaked in 1971, although it experienced a bit of a resurgence in 2007... any ideas what happened to cause that?
The research was recently published in the journal Science.