Books used to create 'fossil record' of human culture

December 18, 2010

Researchers have analyzed the text from four percent of all books ever published, to learn about how human culture has changed (Photo: Tom Murphy VII)

View 1 Image

1/1

Researchers have analyzed the text from four percent of all books ever published, to learn about how human culture has changed (Photo: Tom Murphy VII)

You may have Facebook friends who have done the “Here are the top words from my Facebook status messages!” thing, where it lists the words they’ve most commonly used in telling the world about their lives. Interesting as that may or may not be, imagine something similar being done with four percent of all books ever published. That’s what a team of researchers from Harvard University, Google, Encyclopaedia Britannica, and the American Heritage Dictionary have done. The resulting dataset is made up of the full text of about 5.2 million books, 72 percent of that text being in English, with French, German, Chinese, Russian, and Hebrew making up the rest. Analyzing that dataset, a practice that the researchers call “culturomics” (a play on genomics), has revealed some fascinating things about the history of our species.

“Now that a significant fraction of the world's books have been digitized, it's possible for computer-aided analysis to reveal undiscovered trends in history, culture, language, and thought,” said Jon Orwant, engineering manager for Google Books.

One of those trends is the adoption of “unsanctioned” new words. It was found that about 8,500 new words enter the English language annually, which caused a 70 percent growth in the lexicon between 1950 and 2000. Approximately 52 percent of that lexicon, however, consists of words not found in dictionaries.

It also turns out that people in 2000 likely cared a lot less about the year 1950, than people in 1950 cared about the year 1900 – overall, humanity pays less attention to the past with every passing year. After tracking the frequency at which every year from 1875 to 1975 appeared in text, the team discovered that mention of the year 1880 didn’t fall by more than half until 32 years later, whereas mention of the year 1973 fell by half within just a decade.

Years aren’t the only things that are getting forgotten sooner, either. Celebrities, although they’re now achieving fame at an earlier average age, lose that fame much more quickly than ever before. Culturomics also shows up instances of things that entire nations were made to forget – from 1936 to 1944, mention of Jewish artist Marc Chagall increased by fivefold in English text, while he was mentioned just once in German literature. Similar suppression was found with Russian mention of Leon Trotsky, Chinese mention of Tiananmen Square, and American mention of the blacklisted “Hollywood Ten” actors in 1947.

If you have a word that you’re curious about, you can now go type it into the Books Ngram Viewer, an online interface that Google Labs created based on the study. A graph will instantly show you how the usage frequency of that word has changed over the past several hundred years.

Not surprisingly, it turns out that the word “groovy” peaked in 1971, although it experienced a bit of a resurgence in 2007... any ideas what happened to cause that?

The research was recently published in the journal Science.

6 comments

Hoodoo December 19, 2010 02:16 AM

\"...any ideas what happened to cause that?\"
Chalk one up for Austin Powers, baby! Yeah!

Light_Lab December 19, 2010 09:36 PM

What I find strange is that the words: capacitor, transistor, internet, and computer all have a little peak between 1900 and 1910. Rather than believe in some Edwardian visionary I suspect one book has been incorrectly dated???

The Floridian December 20, 2010 01:01 PM

I used this new Google service to follow my hunch on the word "Orwellian" between 1950 and 2008 (apparently the most recent available date). It tells the story graphically better than words. I wonder what "double-speak" will look like?

Paul Anthony December 20, 2010 02:07 PM

I entered robotics in there and got a hit for 1882. I always thought it was in 1920\'s that it first appeared. But there it was in black and white. However as Light_Lab mentioned above, they could have entered a wrong date.

Facebook User December 20, 2010 08:55 PM

Would be much great if applied to newspapers covering nowadays to the last centaury to find relations between propagandist concepts. But for now, a software could analize the (downloadable) raw data to search for other linearities. for example have a look at the graph found by \"bomb,aircraft,propaganda,gasoline,job\"

Facebook User December 23, 2010 03:02 PM

Regarding the early uses of Robot. I did a google book search for the word from 1800-1920 and came up with one book that was listed as 1906 but should probably have been from about 1986 judging from some names attached to an article about ping-pong playing robots. Also there was a listing from 1893 for a Robot Foster Barnard but when you click through it shows the scanned page and it should be Robert. A book from 1858 talking of Eastern European Peasant service called \'robot\' which I believe is the origin of the word, there are several other books with a similar usage.
So a combination of poor scans mislabeled copyright years and original root.

Books used to create 'fossil record' of human culture

Tags

Most Viewed

Apollo laser takes down 200 drones unplugged

5,200 holes carved into a Peruvian mountain left by an ancient economy

Toyota's tiny, barebones IKEA pickup could be its most versatile ever

FREE NEWSLETTER