Library of Congress acquires entire Twitter archive

Yep, it’s true. See if you can wrap your head around this. The great institution of All Things Worth Saving will now be saving for all eternity the archive of All Things Not Meant To Be Saved: How Tweet It Is!: Library Acquires Entire Twitter Archive. Says the LOC:

Have you ever sent out a “tweet” on the popular Twitter social media service?  Congratulations: Your 140 characters or less will now be housed in the Library of Congress.

That’s right.  Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress. That’s a LOT of tweets, by the way: Twitter processes more than 50 million tweets every day, with the total numbering in the billions.

They go on to list some noteworthy tweets that may be worth remembering in ten thousand years and beyond:

Just a few examples of important tweets in the past few years include the first-ever tweet from Twitter co-founder Jack Dorsey (http://twitter.com/jack/status/20), President Obama’s tweet about winning the 2008 election (http://twitter.com/barackobama/status/992176676), and a set of two tweets from a photojournalist who was arrested in Egypt and then freed because of a series of events set into motion by his use of Twitter (http://twitter.com/jamesbuck/status/786571964) and (http://twitter.com/jamesbuck/status/787167620).

At the current rate of 50 million tweets per day, that’s 18,250,000,000 tweets per year, or 3,832,500,000,000 tweets every 210 years, the amount of time since the Library of Congress was founded in 1800. Of course, once everybody on the planet is tweeting hundreds of times per day, along with their household pets, appliances, and spambots, there could be 50 billion tweets per day. So attention LOC librarians: time to sharpen those pencils and roll up your sleeves — you’re about to get real busy chasing stray tweets.

Comment

Mike Siwek and Audrey Fino demonstrate Google’s semantic genius

Wired has an interesting article in its March, 2010 issue, How Google’s Algorithm Rules the Web, that illuminates some of the reasons why Google’s search algorithm works so well and sets Google apart from other search engines.

“The algorithm is extremely important in search, but it’s not the only thing,” says Brian MacDonald, Microsoft’s VP of core search. “You buy a car for reasons beyond just the engine.”

Google’s response can be summed up in four words: mike siwek lawyer mi.

Amit Singhal types that koan into his company’s search box. Singhal, a gentle man in his forties, is a Google Fellow, an honorific bestowed upon him four years ago to reward his rewrite of the search engine in 2001. He jabs the Enter key. In a time span best measured in a hummingbird’s wing-flaps, a page of links appears. The top result connects to a listing for an attorney named Michael Siwek in Grand Rapids, Michigan. It’s a fairly innocuous search — the kind that Google’s servers handle billions of times a day — but it is deceptively complicated. Type those same words into Bing, for instance, and the first result is a page about the NFL draft that includes safety Lawyer Milloy. Several pages into the results, there’s no direct referral to Siwek.

The comparison demonstrates the power, even intelligence, of Google’s algorithm, honed over countless iterations. It possesses the seemingly magical ability to interpret searchers’ requests — no matter how awkward or misspelled. Google refers to that ability as search quality, and for years the company has closely guarded the process by which it delivers such accurate results….

Of course, just by being written about and linked to from Wired, Lawyer Siwek’s search results have all been skewed. But no matter — that’s to be expected. What interests me is how Google is essentially playing with language, like we do here at Wordlab, in a continuous effort to break it up and recombine it to figure out contextual relationships and semantic meaning.

…Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.

Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”

But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”

…for the most part, the improvement process is a relentless slog, grinding through bad results to determine what isn’t working. One unsuccessful search became a legend: Sometime in 2001, Singhal learned of poor results when people typed the name “audrey fino” into the search box. Google kept returning Italian sites praising Audrey Hepburn. (Fino  means fine in Italian.) “We realized that this is actually a person’s name,” Singhal says. “But we didn’t have the smarts in the system.”

The Audrey Fino failure led Singhal on a multiyear quest to improve the way the system deals with names — which account for 8 percent of all searches. To crack it, he had to master the black art of “bi-gram breakage” — that is, separating multiple words into discrete units. For instance, “new york” represents two words that go together (a bi-gram). But so would the three words in “new york times,” which clearly indicate a different kind of search. And everything changes when the query is “new york times square.” Humans can make these distinctions instantly, but Google does not have a Brazil-like back room with hundreds of thousands of cubicle jockeys. It relies on algorithms.

The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”

This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”

Thanks to the Interweb, Mike Siwek and Audrey Fino (and, from the article, Garden Grove psychologist Cindy Louise Greenslade) are now cyberlebrities, endlessly spidered, SERPed, and iterated through the dreamtime that is not real, yet we think about it. Like conceptual art.

Comment

Virus update: one million English words, and counting

The English language received its official unofficial one millionth word this morning at 5:22 a.m. ET. And, just in time for the coming Web 3.0 phenomena, the one millionth word is…wait for it…

Web 2.0.

Of course, “Web 2.0” being crowned the One Millionth English Word, and having the coronation at exactly 5:22 this morning, is just an estimate, made buy a website called the Global Language Monitor, “a Web site that uses a math formula to estimate how often words are created.” I like that: words used to describe a math formula used to estimate how many words there are that could be put to use to describe math formulas that estimate…well, you get the picture.

According to the article today on CNN.com:

[Global Language Monitor] estimates the millionth English word, “Web 2.0” was added to the language Wednesday at 5:22 a.m. ET. The term refers to the second, more social generation of the Internet.

The site says more than 14 words are added to English every day, at the current rate.

The “Million Word March,” however, has made the man who runs this word-counting project somewhat of a pariah in the linguistic community. Some linguists say it’s impossible to count the number of words in a language because languages are always changing, and because defining what counts as a word is a fruitless endeavor.

Paul J.J. Payack, president and chief word analyst for the Global Language Monitor, says, however, that the million-word estimation isn’t as important as the idea behind his project, which is to show that English has become a complex, global language.

“It’s a people’s language,” he said.

Other languages, like French, Payack said, put big walls around their vocabularies. English brings others in.

“English has the tradition of swallowing new words whole,” he said. “Other languages translate.”

Certainly that’s what Wordlab has always been about: swallowing new words whole…and then regurgitating them in new combinations.

Still, Payack says he doesn’t include all new words in his count. Words must make sense in at least 60 percent of the world to be official, he said. And they must make sense to different communities of people. A new technology term that’s only understood in Silicon Valley wouldn’t count as a mainstream word, he said.

His computer models check a total of 5,000 dictionaries, scholarly publications and news articles, as well as billions of Web sites, to see how frequently words are used, he said. A word must make 25,000 appearances to be deemed legitimate.

Payack said news events have also fueled the rapid expansion of English, which he said has more words than any other language. Mandarin Chinese comes in second with about 450,000 words, he said.

English terms like “Obamamania,” “defriend,” “wardrobe malfunction,” “zombie banks,” “shovel ready” and “recessionista” all have grown out of recent news cycles about the presidential election, economic crash, online networking or a sports event, he said. Other languages might not have developed new terms to deal with such phenomena, he said.

That the true beauty and power of English, and its new global function: serving as a language laboratory for the entire world. An interesting corollary question would be how many English words die out every day, week or month? None of these new words get carved in stone, and even the Oxford English Dictionary is filled with many archaic words no longer in use.

Language experts who spoke with CNN said they disapprove of Payack’s count, but they agree that English generally has more words than most, if not all, languages.

“This is stuff that you just can’t count,” said Jesse Sheidlower, editor at large of the Oxford English Dictionary. “No one can count it, and to pretend that you can is totally disingenuous. It simply can’t be done.”

The Oxford English Dictionary has about 600,000 entries, Sheidlower said. But that by no means includes all words, he said.

… Part of what makes determining the number of words in a language so difficult is that there are so many root words and their variants, said Sarah Thomason, president of the Linguistic Society of America and a linguistics professor at the University of Michigan.

… Linguists and lexicographers run into further complications when trying to count words that are spelled one way but can have several meanings, said Allan Metcalf, an English professor at MacMurray College in Illinois, and an officer at the American Dialect Society.

“The word bear, b-e-a-r — is that two words or one, for example? You have a noun that’s a wild creature and then you have b-e-a-r, [which means] to bear left or to bear right, and there’s many other things,” he said. “So you really can’t be exact about a millionth word.”

Can any of these linguists or word-counters bear to get into pun territory? Absolutely each meaning of “bear” and every other word should count as a separate word — again, multiple meanings, puns, homonyms, all are part of what gives the English language so much flavor and customizability (not a word, BTW, according to the OED). Call it Language 2.0 if you must (but really, please don’t — I’m just planting a virus here).

[Payack] said the count is meant to be a celebration of English as a global language. And, while he says other languages are being stamped out by English’s expansion, it’s a powerful thing that so many people today are able to communicate with such a vast list of words.

Here here, brother. As William S. Burroughs famously said, “Language is a virus“. And English, with its metastasizing foam of wordbirth and worddeath, is the smallpox of languages.

Comment

From Milan to Microsoft Surface

There’s a fundamental change occurring in the world of technology. Code-named Milan, a new computer has surfaced at Microsoft. Surface.

“Pretty exciting, eh?” Gates said with a sly smile, when he put his hand down on what looked initially like a low, black coffee table: At the touch of his hand, the hard, plastic tabletop suddenly dissolved into what looked like tiny ripples of water. The ‘water’ responded to each of his fingers and the ripples rushed quickly away in every direction.

“Go ahead,” he said. “Try it.” When I placed my hand on the table at the same time, there were more ripples.

It took a moment to appreciate what was happening. Every hand motion Gates or I did was met with an immediate response from the table. There was no keyboard. There was no mouse. Just our gestures.

“All you have to do is reach out and touch the Surface,” Gates told me with barely concealed pride. “And it responds to what you do.”

In an industry whose bold pronouncements about the future have taught me the benefits of skepticism, Surface took my breath away. If the Surface project rollout goes as planned in November, it could alter the way everyday Americans control the technology that currently overwhelms many of us.

But does it come with Pong?

Comment