Erez Lieberman Aiden and Jean-Baptiste Michel discuss Google Labs’ NGram Viewer. In their TED talk what can you learn from 5 million books?
Some projections about the future from the webcomic XKCD:
by Juan Carlos Rocha (PhD student at Stockholm Resilience Centre working on Regime Shifts)
An N-gram is a sequence of characters separated by a space in a text. An N-gram may be a word, a number or a combination of both. The concept of N-grams simplifies the application of statistical methods to assess the frequency of a word or a phrase in body of text. N-gram statistical analyses have been around for years, but recently Jean-Baptiste Michel and collaborators had the opportunity to applying N-gram text analysis techniques to the massive Google Books collection of digitalized books. They analyzed over 5 million documents which they estimate are about 4% of all books ever published, and published their work in Science [doi].
The potential of exploring huge amounts of text, which no single person could read, provides the opportunity to trace the use of words over time. This allows researchers to track the impact of events on word use and even the evolution of language, grammar and culture. For example, by counting the words used in English books, the team found that in the year 2000 the English lexicon had over one million words, and it has been growing about 8500 words per year. Similarly, they were able to track word fads, for example the changes in the regular or irregular forms of verb conjugations over time (e.g. burned vs burnt). More interestingly, based on particular events and famous names they identified that our collective memory, as recorded in books, has both a short-term and long-term component; we are forgetting our past faster than before; but we are also learning faster when it comes to, for example, the adoption of technologies.
The options for reading books with machine eyes does not end there. Censorship during the German Nazi regime was identified by comparing the frequency of author’s names in the German and English corpus. The researchers could detect a fingerprint of the suppression of a person’s ideas in the language corpus.
The researchers term this quantitative analysis of our historic knowledge and culture through the analysis of this huge amount of data – culturomics. They plan further research will incorporate newspapers, manuscripts, artwork, maps and other human creations. Possible future applications are the development of methods for historical epidemiology (e.g. influenza peaks), the analysis of conflicts and wars, the evolution of ideas (e.g. feminism), and I think, why not ecological regime shifts?
Above you can see the frequency of some of the regime shifts we are working with in the English corpus. Soil salinization and lake eutrophication appear in 1940’s and 1960’s respectively, probably with the first description of such shifts. Similarly, coral bleaching take off during the 1980’s when reef degradation in the Caribbean basin began to be documented. Similarly, the concept of regime shift has been more and more used since 1980’s, probably not only to describe ecological shifts but also political and managerial transitions.
Although data may be noisy, the frequency of shock events may be tracked as well. Here for example we plot oil spill and see the peak corresponding to the case of January 1989 in Floreffe, Pennsylvania. Note that it does not show the oil spill in the Gulf of Mexico last year because the database is updated to 2008.
Google Creative Lab has collaborated with the Montreal band, Arcade Fire to create a interactive web movie “The Wilderness Downtown” using Google earth. Director Chris Milk combines the nostalgia of the new Arcade Fire song “We Used to Wait” with Google maps and street view images of the streets where the viewer lived to produce a very impressive combination of art and technology.
Wired blog Epicentre has an article that gives some background on the project:
The project came about one day when [director] Chris Milk and I were talking about Chrome Experiments and what can be achieved through a modern web browser and with the power of HTML5 technology,” said Google Creative Lab tech lead and co-creator of the project Aaron Koblin. “We were excited about breaking out of the traditional 4:3 or 16:9 video box, and thinking about how we could take over the whole browser experience. Further, we wanted to make something that used the power of being connected. In contrast to a traditional experience of downloading a pre-packaged video or playing a DVD, we wanted to make something that was incorporating data feeds on the fly, and tailoring the experience to a specific individual.
“One of the biggest struggles for a director is to successfully create a sense of empathy with their characters and settings. Using Google Maps and Street View we’re able to tailor the experience to each person. This effect is a totally different kind of emotional engagement that is both narrative and personally driven.”
…“Experiences” such as this will evolve to look much slicker in the future, but already, they’re capable of some fairly incredible maneuvers, integrating Arcade Fire’s stirring music with data from Google Maps and Google Street View, topping it all off with input from the user.
We’re impressed, but some streamlining will be required if bands that aren’t big enough to play Madison Square Garden, as Arcade Fire is, are going to be able to offer it. We counted a full 111 names in the credits.
Google and renewable energy? Hackers, deforestation and carbon emission rights? This might sound like an odd mix of events, but something is definitely in pipeline. Global environmental change and rapid information technological change have for a long time been viewed as parallel, and decoupled global phenomena. A number of events in the last month indicate that this is likely to change. Just consider the following events:
Internet giant Google recently got an approval in the US, to buy and sell energy. This happens after the company’s explicit ambition to become one of the major players in renewable energy. According to the New York Times: “The company’s Green Energy Czar Bill Weihl said the company was fully committed to accelerating the development of renewable energy technologies that can prove more cost-effective than coal power, as a means of both curbing carbon emissions and trimming its own giant energy bill”.
In addition, computer hackers seem to have found a new pool of resources to steal from – emissions trading. As reported by Wired recently, hackers have been successful in stealing millions of dollars by launching “a targeted phishing attack against employees of numerous companies in Europe, New Zealand and Japan, which appeared to come from the German Emissions Trading Authority”. A similar attack was assumed in Brazil in December 2008 when hackers managed to get in to the government logging databases. The impacts? Illegal harvest of 1.7 million cubic meters of timber, according to Wired.
One final example is of course the ongoing bashing of the IPCC, and the now infamous e-mail hack of UK climate scientists. An interesting follow up is this op-ed in The Australian, arguing that the Internet is allowing climate change skeptics to gain traction. One of the more thought-provoking quotes from the article states:
The `climate consensus’ may hold the establishment — the universities, the media, big business, government — but it is losing the jungles of the web. After all, getting research grants, doing pieces to camera and advising boards takes time. The very ostracism the sceptics suffered has left them free to do their digging untroubled by grant applications and invitations to Stockholm.
See also John Bruno of climateshifts.org, who asks “Who is orchestrating the cyber-bullying?”.
Are moving into an era of cyber-environmental politics? I’m pretty sure that we are.
The text message from the elephant flashed across Richard Lesowapir’s screen: Kimani was heading for neighboring farms.
The huge bull elephant had a long history of raiding villagers’ crops during the harvest, sometimes wiping out six months of income at a time. But this time a mobile phone card inserted in his collar sent rangers a text message. Lesowapir, an armed guard and a driver arrived in a jeep bristling with spotlights to frighten Kimani back into the Ol Pejeta conservancy.
Kenya is the first country to try elephant texting as a way to protect both a growing human population and the wild animals that now have less room to roam. …
The race to save Kimani began two years ago. The Kenya Wildlife Service had already reluctantly shot five elephants from the conservancy who refused to stop crop-raiding, and Kimani was the last of the regular raiders. The Save the Elephants group wanted to see if he could break the habit.
So they placed a mobile phone SIM card in Kimani’s collar, then set up a virtual “geofence” using a global positioning system that mirrored the conservatory’s boundaries. Whenever Kimani approaches the virtual fence, his collar texts rangers.
They have intercepted Kimani 15 times since the project began. Once almost a nightly raider, he last went near a farmer’s field four months ago.
It’s a huge relief to the small farmers who rely on their crops for food and cash for school fees. Basila Mwasu, a 31-year-old mother of two, lives a stone’s throw from the conservancy fence. She and her neighbors used to drum through the night on pots and pans in front of flaming bonfires to try to frighten the elephants away.
…the experiment with Kimani has been a success, and last month another geofence was set up in another part of the country for an elephant known as Mountain Bull. Moses Litoroh, the coordinator of Kenya Wildlife Service’s elephant program, hopes the project might help resolve some of the 1,300 complaints the Service receives every year over crop raiding.