查看原文
其他

我从500万本书里学到了什么?

Love English 2 2022-12-23

Love English 2 助大家快乐学英语!
点开上方链接有惊喜!
演讲者:Jean-Baptiste Michel + Erez Lieberman Aiden
演讲题目:What we learned from 5 million books
Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we at Harvard were wondering if this was really true. (Laughter) So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.
Erez Liberman Aiden:人说一副画面抵过一千个词,但是我们在哈佛大学却在思考这是不是一定正确。我们召集了各方专家,他们来自哈佛,麻省理工,《英国大百科全书》,《美国传统英语字典》,还有我们骄傲的赞助商,谷歌。我们思考了大概四年,最后得出一个惊人的结论。女士们先生们,一副画面可不止一千个词那么简单。事实上我们发现有时候一幅画面抵过5千亿个词。
 
Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history: change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an X-axis for that, which is the practical axis. This is very, very low.
Jean-Baptiste Michel: 我们是如何得出这个结论的呢?是这样的,Erez和我在想怎样找到一幅展现人类文明和人文历史的画面:历史的变迁,人们在漫长岁月中写了很多书。所以我们想向他们学习的最佳方法,就是把那几百万本书全都读完。当然,如果用坐标来表示这样做的好处,那Y轴上的值一定是极高的。但问题是还有X轴,也就是可行性,这是极低的。
 
Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That's very practical and extremely awesome.
现在人们倾向于另一种做法,那就是选择几本书进行精读。可行性极高但还不够好。人们真正想要的是一个既好又可行的方法。结果,在水一方,有一家叫“谷歌”的公司,他们在此之前的几年前就开始了一个数字化工程,有可能帮我们找到这个“既好又可行”的方法。他们已经将几百万本书进行了数字化,这就意味着人们在电脑上点几个键就能阅读所有的书。这真的是既可行又好。
 
ELA: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.
这些书是哪里来的呢?从古时候开始人们就开始写作了,这些作家写书都非常卖力。几个世纪前印刷机问世了,写书的过程变得简单多了。自那以后作家们已经出版了1.29亿本书。如果这些书没有随年月而遗失,就都在图书馆里存着。谷歌已经把许多书从图书馆中调了出来,进行了数字化。被扫描的书籍到目前已有1500万册。
 
Now when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of five million books, 500 billion words, a string of characters a thousand times longer than the human genome -- a text which, when written out, would stretch from here to the Moon and back 10 times over -- a veritable shard of our cultural genome.
谷歌扫描图书时把书的格式做得很好。现在我们不但有了数据,还有元数据,我们掌握了这些书的出版地,作者,出版时间等信息。接下来,我们就要从所有这些记录中筛选出质量最高的数据。最后剩下的是5百万本书,5000亿个词,这么多词连起来长度是人类基因组的1000倍。如果把这些词连续写出来其长度相当于在地月之间往返10次以上,这还仅是我们文化基因组的小小一段。
 
Of course what we did when faced with such outrageous hyperbole ... (Laughter) was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, "Stand back. We're going to try science."
当然啦,面对如此令人崩溃的结果,我们做了一个懂得自重的研究者应该做的事。我们借鉴了XKCD(科学漫画)说:“往后站。我们要用科学来解决问题。”
 
JM: Now of course, we were thinking, well let's just first put the data out there for people to do science to it. Now we're thinking, what data can we release? Well of course, you want to take the books and release the full text of these five million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have five million, that is, five million authors and five million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that's extremely, extremely impractical. (Laughter)
当然 这时我们在想,何不先把数据放上去,让人们通过科学来运用数据。现在我们在思考哪些数据可以公开。你当然想把这所有5百万本书全文公开。现在谷歌,具体地说是乔恩. 奥温特教给我们一个有用的方程式。你有5百万本书,那就有五百万个作者。一个有5百万个原告的官司可不小啊!所以尽管这是个好想法,但是也极不现实。
 
Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we're going to release statistics about the books. So take for instance "A gleam of happiness." It's four words; we call that a four-gram. We're going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of two billion lines that tell us about the way culture has been changing.
现在我们做出些许让步,采用一个非常可行但稍微没那么好的方法。我们不公开全书内容,而是公开书本的相关统计数据,拿“A gleam of happiness”这个词组做例子,它有四个单词,我们称它为四字格。我们会告诉你直到2008年出版的书中,在1801年,1802年,1803年一直到2008年,某个四字格一共出现了多少次。这让我们看到这个词组在这段时期内被使用的频率。我们对在这些书中的所有单词和词组都这么处理。于是我们得出了一个由20亿曲线,表示出文化变化的情况。
 
ELA: So those two billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, "Yesterday, I throve." Alternatively, I could say, "Yesterday, I thrived." Well which one should I use? How to know?
这20亿条曲线,我们成作20亿个n字格,它们告诉了我们什么。这些n字格衡量的是文化的走势。我来举个例子,假设我正在发财,明天我告诉你我发财的情况,我会说:“昨天,我发了。”也可以说:“昨天,我发财了。”我到底应该用哪个说法呢?怎么找答案?
 
As of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you'd say, "Steve, you're an expert on the irregular verbs. What should I do?" And he'd tell you, "Well most people say thrived, but some people say throve." And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, (Laughter)
6个月以前很流行的做法是,比如说你去问这位秀发飘逸的心理学家,你说,“史蒂夫,你是不规则动词的专家。我该怎么办啊?”他会说:“大多数人说‘发财了’,但有些人说‘发了’。”如果你可以回到200年前,问问这位秀发同样飘逸的政治家。
 
"Tom, what should I say?" He'd say, "Well, in my day, most people throve, but some thrived." So now what I'm just going to show you is raw data. Two rows from this table of two billion entries. What you're seeing is year by year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows. So the entire data set is a billion times more awesome than this slide.
“托马斯,我该怎么说?”他会回答:“嗯,在我的时代,大多数人说‘发了’,但是少数人说‘发财了’。”现在我给你们看一个原始数据。这是20亿本书中的其中两本书的曲线。你们将看到“发了”和“发财了”这两个词随时间的推移被使用的频率。这还只是20亿条曲线中的其中两条,整套数据比这张幻灯片要宏伟10亿倍。
 
JM: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.
很多画面都相当于5千亿个词,比如这一幅,如果你找“流行感冒”这一词,你会看到几个全球范围内祸害人命的流感高峰。
 
ELA: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.
如果这不足以令人信服,海平面正在上升,大气中二氧化碳含量和全球气温都在升高。
 
JM: You might also want to have a look at this particular n-gram, and that's to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.
你们也可以看看这个n字格,告诉尼采上帝没死,你可能也认为他或许要换一个企宣了。
 
ELA: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the 30s and 40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big.
你可以通过这个得到非常抽象的概念。我跟你们说说1950年的历史,在漫漫历史长河中几乎没人在意1950年,1700年,1800年,1900年。没有人在意20世纪三十年代和四十年代,没有人在意。到了四十年代中期突然间关注度飞升,人们意识到1950年快来了。这一年可能非同小可啊!
 
But nothing got people interested in 1950 like the year 1950. (Laughter) People were walking around obsessed. They couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in '51, '52, '53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. (Laughter) And just like that, the bubble burst.
1950年正如人们想象的一样,没发生任何有意思的事情。人们都着了魔了,无时无刻不在谈论他们1950年做过的事情。他们打算在1950年做的事情,或者他们1950年想要实现的梦想。事实上1950年是不同凡响的一年,即使过了好多年人们还是不停地谈论那年发生的所有美好事情。51年,52年,53年,终于到了1954年,人们醒悟过来。1950年已成往事了。就这样,泡泡破了。
 
And the story of 1950 is the story of every year that we have on record, with a little twist, because now we've got these nice charts. And because we have these nice charts, we can measure things. We can say, "Well how fast does the bubble burst?" And it turns out that we can measure that very precisely. Equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly.
1950年的情况以及每一年的情况我们都记录了下来。多亏了这些漂亮的图表,我们的工作顺利多了。有了这些漂亮的图表,我们就能测量各种事物。我们会说:“泡泡破掉的速度有多快?”事实证明,我们可以非常精确地测量。推导方程,绘制图表,最终结果是泡泡破掉的速度 每年都在加快。我们对过去的遗忘不断加快。
 
JM: Now a little piece of career advice. So for those of you who seek to be famous, we can learn from the 25 most famous political figures, authors, actors and so on. So if you want to become famous early on, you should be an actor, because then fame starts rising by the end of your 20s -- you're still young, it's really great. Now if you can wait a little bit, you should be an author, because then you rise to very great heights, like Mark Twain, for instance: extremely famous.
好,现在给大家一些发展事业的建议。如果你想成名,我们可以向25位最著名的政治人物,作家,演员学习。如果你想早点成名,你就应该做个演员。因为演员在20来岁的时候成名,你还很年轻,这是本钱。如果你能等一等,那就当个作家,因为你可以像马克.吐温这样成为文坛巨星。
 
But if you want to reach the very top, you should delay gratification and, of course, become a politician. So here you will become famous by the end of your 50s, and become very, very famous afterward. So scientists also tend to get famous when they're much older. Like for instance, biologists and physics tend to be almost as famous as actors. One mistake you should not do is become a mathematician. (Laughter) If you do that, you might think, "Oh great. I'm going to do my best work when I'm in my 20s." But guess what, nobody will really care.
如果你想到达万人之上,你就不能安于现状,要成为一个政治家,到了快60岁的时候你就成名了,而且之后名声远扬。科学家通常在年纪一大把的时候才成名,生物学家和物理学家的名声通常能跟演员的名声媲美。有一个错误你不要犯,那就是成为一个数学家。如果你成了数学家,你会想:“太好啦,我20多岁的时候会有最辉煌的成就。”谁知道人们连睬都不睬你。
 
ELA: There are more sobering notes among the n-grams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887. And this looks like the normal trajectory of a famous person. He gets more and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945, before rebounding afterward. And of course, what we're seeing is the fact Marc Chagall was a Jewish artist in Nazi Germany.
n字格中有些情况更为明了。这是Marc Chagall的名声起落,他是出生于1887的一位艺术家,他的名声起落看似乎没有什么异常,他的名声越来越大。然而如果你在德语书中搜索情况就不同了。在德语书中你会看到非常奇怪的现象,闻所未闻,见所未见,他先是名极一时,但突然之间名声直线下落。在1933年到1945年间达到了低谷,后来才回升,当然,实际情况是Marc Chagall是一个犹太艺术家,当时身在纳粹德国。
 
Now these signals are actually so strong that we don't need to know that someone was censored. We can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect. And we compare that to the fame that we observe. And we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.
这些信号实在太强了,我们无需知道谁被禁了,我们事实上可以通过非常基本的信号处理来找出答案。这里有一个简单的方法,一个人在特定时期内所拥有的知名度,应当大致为他成名前与成名后知名度的平均值。这么想是有道理的,我们也是怎么想的。我们把观察到的知名度进行对比,我们把前者比上后者,产生的结果叫做抑制指数。如果抑制指数非常非常小,那么你的知名度正在被抑制,如果数值非常大,或许就表明你从宣传中获益。
 
JM: Now you can actually look at the distribution of suppression indexes over whole populations. So for instance, here -- this suppression index is for 5,000 people picked in English books where there's no known suppression -- it would be like this, basically tightly centered on one. What you expect is basically what you observe. This is distribution as seen in Germany -- very different, it's shifted to the left. People talked about it twice less as it should have been.
你还可以看到压抑指数在总人数中的分布情况。这里有个例子,这是从没有明显抑制的英文书籍中选出的5000个人。它是这个样子的,基本上以1为中心,实际情况与预想差不多,而这在是德文书籍中的分布情况与前者大为不同,往左偏了,人们对它的关注较预期要少了两倍。
 
But much more importantly, the distribution is much wider. There are many people who end up on the far left on this distribution who are talked about 10 times fewer than they should have been. But then also many people on the far right who seem to benefit from propaganda. This picture is the hallmark of censorship in the book record.
更重要的是这个分布的跨度更宽。不少人处于左边的部分,人数比预期中少了10倍。而也有不少人处于更靠右的部分,他们的宣传起了作用,这幅图反映了书籍记录中的审查情况。
 
ELA: So culturomics is what we call this method. It's kind of like genomics. Except genomics is a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar. It's the application of massive-scale data collection analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record. The great thing about culturomics is that everyone can do it.
我们把这种方法称作文化组学。有点像基因组学。只不过基因组学是生物学上观察人类基因组序列的透镜,文化组学很类似,它指的是对人类文明研究的大规模数据收集分析的应用。它使用的不是基因组这个透镜,而是用数字化的历史记录片段作为透镜。文化组学的优点是人人都会用它。
 
Why can everyone do it? Everyone can do it because three guys, Jon Orwant, Matt Gray and Will Brockman over at Google, saw the prototype of the Ngram Viewer, and they said, "This is so fun. We have to make this available for people." So in two weeks flat -- the two weeks before our paper came out -- they coded up a version of the Ngram Viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its n-gram immediately -- also browse examples of all the various books in which your n-gram appears.
为什么呢?这是因为这三个人,谷歌的乔恩.奥温特,迈特.格雷和威尔.布洛克曼,看到了n字格后说:“这太有意思了,我们得让所有人都用上它。”于是在我们的论文发表之前的整整两个星期中,他们编了一个面向公众的Ngram Viewer版本。现在你们也可以输入任何你感兴趣的单词或词组,查看它的n字格,并阅览所有书籍中出现n字格的例句。
 
JM: Now this was used over a million times on the first day, and this is really the best of all the queries. So people want to be their best, put their best foot forward. But it turns out in the 18th century, people didn't really care about that at all. They didn't want to be their best, they wanted to be their beft. So what happened is, of course, this is just a mistake. It's not that strove for mediocrity, it's just that the S used to be written differently, kind of like an F. Now of course, Google didn't pick this up at the time, so we reported this in the science article that we wrote. But it turns out this is just a reminder that, although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the base standards in the sciences.
这个词在第一天就被使用了超过一百万次。这真的是最棒的一个搜索词,人们总想做到最好。总想展示最好的一面。但是在18世纪人们对此并不在乎,他们不想做到最好(“best”)而是“beft”。实际上这是个错别字,这并不是因为人们不识字,而是因为当时英文字母S的写法跟现在不同,看起来像F。当然谷歌没有意识到这一点,于是我们对此在论文中做了报告。这实际上只是一个小提示,尽管这很有趣,但是你在解读这些图表时仍须非常谨慎,你必须遵循基本的科学准则。
 
ELA: People have been using this for all kinds of fun purposes. (Laughter) Actually, we're not going to have to talk, we're just going to show you all the slides and remain silent. This person was interested in the history of frustration. There's various types of frustration. If you stub your toe, that's a one A "argh." If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an eight A "aaaaaaaargh." This person studies all the "arghs," from one through eight A's. And it turns out that the less-frequent "arghs" are, of course, the ones that correspond to things that are more frustrating -- except, oddly, in the early 80s. We think that might have something to do with Reagan.
人们使用它来寻求各种乐趣。我们不打算多说,光给你们看这些幻灯片,这个用户对人们烦躁的历史很感兴趣。这里有不同类型的烦躁,如果你的脚趾被碰了,你会说“argh”,如果地球被外星人毁灭了,开了一条星际航道,那就是"aaaaaaaargh"。这个人研究了不同长短的“啊”,从1个啊到8个啊,结果,那些使用频率较低的啊,代表程度更高的烦躁,八十年代是个例外。我们猜这可能跟里根总统有关。
 
JM: There are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books. That's 12 percent of all the books that have ever been published. It's a sizable chunk of human culture. There's much more in culture: there's manuscripts, there newspapers, there's things that are not text, like art and paintings. These all happen to be on our computers, on computers across the world. And when that happens, that will transform the way we have to understand our past, our present and human culture.
这个数据库的用处很多,但最重要的是这是一个数字化的历史记录。谷歌已经开始对1500万本书进行数字化处理,其中12%的书已被出版。这是人类文明相当大的一部分,而文明还包括更多的内容,有手稿,报纸,非文字的内容,例如艺术与绘画。这些内容都会出现在我们的电脑上,在世界各地的电脑上。如果这成真了,我们对过去现在以及人类文明的认识就被改变了。
 
Thank you very much.
非常感谢大家。

来源:TED演讲

长按识别二维码可关注该微信公众平台

Love English 2 助大家快乐学英语!
点开上方链接有惊喜!

往期回顾


TED演讲160篇+


30岁之前,一定要逼自己成为这样的人
现在的孩子从小就玩电子设备,会毁了童年吗?
徒手攀岩:我如何爬上三千英尺的悬崖

一名普通记者,最终却成了美国记忆大赛冠军

李小龙女儿演讲:父亲不仅是武术家 还是哲学家……
真正的女权主义,并没有你们想象的那么肤浅
为什么持有不同意见的人值得被聆听?

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存