Science

Culturomics research uses quarter-century of media coverage to forecast human behavior

Culturomics research uses quarter-century of media coverage to forecast human behavior
Global geocoded tone of all New York Times content from 1946 (Image: Leetaru)
Global geocoded tone of all New York Times content from 1946 (Image: Leetaru)
View 4 Images
World "civilizations" according to NYT, 1945-2005 (Image: Leetaru)
1/4
World "civilizations" according to NYT, 1945-2005 (Image: Leetaru)
Global geocoded tone of all Summary of World Broadcasts content, January 1979-April 2011 mentioning "bin Laden" (Image: Leetaru)
2/4
Global geocoded tone of all Summary of World Broadcasts content, January 1979-April 2011 mentioning "bin Laden" (Image: Leetaru)
World "civilizations" according to SWB, 1979-2009 (Image: Leetaru)
3/4
World "civilizations" according to SWB, 1979-2009 (Image: Leetaru)
Global geocoded tone of all New York Times content from 1946 (Image: Leetaru)
4/4
Global geocoded tone of all New York Times content from 1946 (Image: Leetaru)
View gallery - 4 images

"Culturomics" is an emerging field of study into human culture that relies on the collection and analysis of large amounts of data. A previous culturomic research effort used Google's culturomic tool to examine a dataset made up of the text of about 5.2 million books to quantify cultural trends across seven languages and three centuries. Now a new research project has used a supercomputer to examine a dataset made up of a quarter-century of worldwide news coverage to forecast and visualize human behavior. Using the tone and location of news coverage, the research was able to retroactively predict the recent Arab Spring and successfully estimate the final location of Osama Bin Laden to within 200 km (124 miles).

The research used the large shared-memory supercomputer called Nautilus, which is part of the National Institute for Computational Sciences (NICS) network of advanced computing resources at Oak Ridge National Laboratory (ORNL) and boasts 1,024 cores and 4 terabytes of global shared memory. The dataset used was formed by combining three massive news archives that totaled more than 100 million articles worldwide. They included the complete New York Times (NYT) from 1945 to 2005, the unclassified edition of the Summary of World Broadcasts (SWB) from 1979 to 2010, and an archive of English-language Google News articles spanning 2006 to 2011. These archives provided a cross-section of the U.S. media spanning half a century and the global media over a quarter-century.

Using this data, Kalev Leetaru of the University of Illinois in Urbana-Champaign and author of the study used advanced tonal, geographic, and network analysis methods to produce a network 2.4 petabytes in size containing more than 10 billion people, places, things, and activities linked by over 100 trillion relationships that provided a cross-section of Earth from the news media. Leetaru let the supercomputer find interesting patterns in the bulk of the data, which he then recreated using a more traditional targeted and smaller-scale approach. In this way, Leetaru was able to produce real-time forecasts of human behavior, such as national conflicts and the movement of specific individuals.

Tone

Leetaru says that examining the tone of a news story is one of the most important aspects of his version of culturomics and the most reliable metric for conflict. He cites the example of the Foreign Broadcast Information Service (FBIS) news-monitoring service, which produced an analytical report on December 6, 1941 - the day before the bombing of Pearl Harbor - that noted the bitterness of Japanese radio broadcasts in relation to the U.S. had increased and appeals for peace had ceased."They recognized the most valuable part about the news was not the factual parts, but the latent parts - the tone, the emotion," said Leetaru.

"Almost every Fortune 500 company monitors the tone of news and social media coverage about their products," Leetaru added. "There's been a huge amount of research coming out of the business literature on the power of news tone to predict economic behavior, yet there hasn't been as much work in using it to predict social behavior."

To create a numeric measurement of overall tone in a document, Leetaru used an algorithm that counted the number of "positive" and "negative" words that appear and assign a positive or negative value. Using dictionaries with pre-assigned positive and negative words, Leetaru used two tone-mining methods. The first counted the density of positive and negative words then subtracted the values to get a measure of overall tone. The second method used a dictionary that numerically rated each word from extremely negative to extremely positive and then averaged the score of all the words found in the story for a more nuanced result.

Location, location, location

Leetaru also used fulltext geocoding to provide an approximate geographic coordinate for locations referenced in a news article and network analysis to show how global media groups countries together in "civilizations." "Using global news coverage, you count how many times every city on Earth is mentioned with every other city in an article," explained Leetaru. "Group those results by country and you have a network of how the world news media relates and frames all the countries on Earth."

Using the SWB and NYT archives provided an insight into how the media of different countries groups countries together. The SWB news led to seven civilizations, while the NYT archive led to only five, with a greater proportion of countries grouped with the U.S.

World "civilizations" according to SWB, 1979-2009 (Image: Leetaru)
World "civilizations" according to SWB, 1979-2009 (Image: Leetaru)

World "civilizations" according to SWB, 1979-2009 (Image: Leetaru)

World "civilizations" according to NYT, 1945-2005 (Image: Leetaru)
World "civilizations" according to NYT, 1945-2005 (Image: Leetaru)

World "civilizations" according to NYT, 1945-2005 (Image: Leetaru)

"Each country's media will depict the world differently," explained Leetaru. "It's a standard principle of journalism - you write for your audience. Still, it vividly reinforces that what we get here in the U.S. is a very U.S. centric view of the world."

Culturomics crystal ball

Using the three key data mining techniques of tone-mining, fulltext geocoding and network analysis, Leetaru was able to produce some interesting results. He says that "pooling together the global tone of all news mentions of a country over time appears to accurately forecast its near-term stability, including predicting the revolutions in Egypt, Tunisia, and Libya, conflict in Serbia, and the stability of Saudi Arabia."While Leetaru says Tunisia played a huge role in the Egyptian revolution, the real beginnings of the revolt can be traced back to the New Year's Eve bombing of a Coptic Church in Alexandria that killed 21 and injured 70. It was this domestic terrorism attack that provoked local anger at the government and the global news media captured this negative shift towards the government and how the bombing, coming on the heels of the Tunisian revolution, could destabilize the country.

Not only was Leetaru able to retroactively predict the Arab Spring and dissect the basis for the uprisings, but he was also able to narrow focus and use the news to map the movement of a specific individual - Osama Bin Laden. Although the city of his death, Abbottabad, is only mentioned once in all the articles within the dataset, it is less than 200 km from the two most popular cities associated with him - Islamabad and Peshawar. In fact, nearly 49 percent of all the articles mentioning Bin Laden included a city in Pakistan.

Global geocoded tone of all Summary of World Broadcasts content, January 1979-April 2011 mentioning "bin Laden" (Image: Leetaru)
Global geocoded tone of all Summary of World Broadcasts content, January 1979-April 2011 mentioning "bin Laden" (Image: Leetaru)

Global geocoded tone of all Summary of World Broadcasts content, January 1979-April 2011 mentioning "bin Laden" (Image: Leetaru)

While Leetaru admits the global news content couldn't provide a definite lock on Bin Laden's location, it suggested that he was almost twice as likely to be found in Pakistan as Afghanistan and that a 200 km radius around Islamabad and Peshawar was his most likely location.

"I never expected to pinpoint him so accurately," admitted Leetaru. "But it's fascinating - if you make a map of all the cities mentioned in articles about him over the last decade it leads to a 200-kilometer radius around where he was found. It begs the question, 'Why did that work so well?'"

Although Leetaru says the findings of his study are captivating, his real goal is to encourage further study.

"The purpose of this paper is not to say, 'Here's the magic bullet that solves these problems,' but more as a road map for future research," he said. "I see it as diving beneath the ocean - we've been so focused on the surface that we're only just beginning to start exploring the entire new world that's underneath."

Leetaru's paper, "Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone in Time and Space," can be read in full in the journal First Monday. The research was funded by the National Science Foundation and managed by the University of Tennessee's Remote Data Analysis and Visualization Center.

View gallery - 4 images
9 comments
9 comments
Doug MacLeod
Scary stuff. Remember Asimov.
Interestingly my wife can do this even more accurately when predicting my behaviour.
Caimbeul
Ain\'t hindsight wonderful?
jackthedog
Sounds a lot like Isaac Asimov\'s \"Foundation\" series. The man was indeed a visonary.
McDesign
Psychohistory, clearly - exactly as forecast in Asimov\'s Foundation Trilogy.
Charles Bosse
I think it would be more interesting to see a set of short term future predictions (like who will win the Republican primary) than retroactive predictions. Of course, this also raises the question of how much the media is self fulfilling and how much it is democratic.
RAMLOT
He mentions the \"unclassified\" version of the SWB as a data source.
It would be interesting if he had access to what is classified or to a recording service like Burrelles that has been recording and transcribing broadcast media for decades.
It\'s not implausible the governments around the World are way ahead of this guy.
danBran
I read the article, and think \"reminds me of Asimov\'s Foundation series\" scroll down to see I\'m not the only one.
BTW I have been trying to remember if it was in one of the Foundation books where a character had a computer the size of a walnut that projected both the screen and the keyboard for use
Scion
I find it pretty cool that this sort of analysis can lead us to better understand how small pieces might fit together to form a larger whole. Like, what was the tipping point for the Arab Spring? Apparently a bombing of a coptic church. That sounds useful. The location of Osama Bin Laden? I could have told you it would be Pakistan to within 200km just because, well, he had very few places he could be. Mind you, if I was him I would have shaved my head clean and moved to Fiji, got a tan and drank margaritas, but then I\'m not hell bent on causing pain and suffering to as many people as possible, so maybe that\'s an important distinction?
Oztechi
Overly controlling countries like China,Myanmar/Burma, North Korea and other countries ruled by dictators or single political parties would love to get their hands on this kind of tool along with data sets and algorithms. That way they could theoretically predict uprisings and plan to squash them earlier and not let them grow.
This could also come in handy as a tool to predict war between countries.