Good Thinking

Books used to create 'fossil record' of human culture

Researchers have analyzed the text from four percent of all books ever published, to learn about how human culture has changed (Photo: Tom Murphy VII)
Researchers have analyzed the text from four percent of all books ever published, to learn about how human culture has changed (Photo: Tom Murphy VII)

You may have Facebook friends who have done the “Here are the top words from my Facebook status messages!” thing, where it lists the words they’ve most commonly used in telling the world about their lives. Interesting as that may or may not be, imagine something similar being done with four percent of all books ever published. That’s what a team of researchers from Harvard University, Google, Encyclopaedia Britannica, and the American Heritage Dictionary have done. The resulting dataset is made up of the full text of about 5.2 million books, 72 percent of that text being in English, with French, German, Chinese, Russian, and Hebrew making up the rest. Analyzing that dataset, a practice that the researchers call “culturomics” (a play on genomics), has revealed some fascinating things about the history of our species.

“Now that a significant fraction of the world's books have been digitized, it's possible for computer-aided analysis to reveal undiscovered trends in history, culture, language, and thought,” said Jon Orwant, engineering manager for Google Books.

One of those trends is the adoption of “unsanctioned” new words. It was found that about 8,500 new words enter the English language annually, which caused a 70 percent growth in the lexicon between 1950 and 2000. Approximately 52 percent of that lexicon, however, consists of words not found in dictionaries.

It also turns out that people in 2000 likely cared a lot less about the year 1950, than people in 1950 cared about the year 1900 – overall, humanity pays less attention to the past with every passing year. After tracking the frequency at which every year from 1875 to 1975 appeared in text, the team discovered that mention of the year 1880 didn’t fall by more than half until 32 years later, whereas mention of the year 1973 fell by half within just a decade.

Years aren’t the only things that are getting forgotten sooner, either. Celebrities, although they’re now achieving fame at an earlier average age, lose that fame much more quickly than ever before. Culturomics also shows up instances of things that entire nations were made to forget – from 1936 to 1944, mention of Jewish artist Marc Chagall increased by fivefold in English text, while he was mentioned just once in German literature. Similar suppression was found with Russian mention of Leon Trotsky, Chinese mention of Tiananmen Square, and American mention of the blacklisted “Hollywood Ten” actors in 1947.

If you have a word that you’re curious about, you can now go type it into the Books Ngram Viewer, an online interface that Google Labs created based on the study. A graph will instantly show you how the usage frequency of that word has changed over the past several hundred years.

Not surprisingly, it turns out that the word “groovy” peaked in 1971, although it experienced a bit of a resurgence in 2007... any ideas what happened to cause that?

The research was recently published in the journal Science.

  • Facebook
  • Twitter
  • Flipboard
  • LinkedIn
6 comments
Hoodoo
\"...any ideas what happened to cause that?\"
Chalk one up for Austin Powers, baby! Yeah!
Light_Lab
What I find strange is that the words: capacitor, transistor, internet, and computer all have a little peak between 1900 and 1910. Rather than believe in some Edwardian visionary I suspect one book has been incorrectly dated???
The Floridian
I used this new Google service to follow my hunch on the word "Orwellian" between 1950 and 2008 (apparently the most recent available date). It tells the story graphically better than words. I wonder what "double-speak" will look like?
Paul Anthony
I entered robotics in there and got a hit for 1882. I always thought it was in 1920\'s that it first appeared. But there it was in black and white. However as Light_Lab mentioned above, they could have entered a wrong date.
Facebook User
Would be much great if applied to newspapers covering nowadays to the last centaury to find relations between propagandist concepts. But for now, a software could analize the (downloadable) raw data to search for other linearities. for example have a look at the graph found by \"bomb,aircraft,propaganda,gasoline,job\"
Facebook User
Regarding the early uses of Robot. I did a google book search for the word from 1800-1920 and came up with one book that was listed as 1906 but should probably have been from about 1986 judging from some names attached to an article about ping-pong playing robots. Also there was a listing from 1893 for a Robot Foster Barnard but when you click through it shows the scanned page and it should be Robert. A book from 1858 talking of Eastern European Peasant service called \'robot\' which I believe is the origin of the word, there are several other books with a similar usage.
So a combination of poor scans mislabeled copyright years and original root.