Mining Books: Just a Start

Last Thursday, Harvard and Google hosted the largest data release in the history of the humanities, a database of 500 billion words contained in 5.2 million books.

It's just the beginning--maybe 4%-5% of all the books ever written--but it's an heroic beginning.

Here's what I've learned so far:

The mention of men vastly exceeded the mention of women in texts until th 1980s when women---after lagging for more than two centuries--finally overtook men. 

Meanwhile, the mention of "God" fell dramatically from about 1835 until 1925 when it leveled off and has been relatively stable ever since.  Maybe even up a little lately.

And, while the mention in books of steak and sausage is relatively flat, the mention of pizza and pasta is up dramatically in the last half-century, and sushi in the last generation.  Ice cream is habitually on top.

Like I said, the data mining has just begun.  So far we know that women have overtaken men, God is down but hanging in, and ice cream rocks.  


Won't it be great when we start learning things we don't already know!


(For a good summary, see the Boston Globe article by Carolyn Y. Johnson here.)

Related Posts :

0 Response to "Mining Books: Just a Start"

Post a Comment