Wired has an interesting article in its March, 2010 issue, How Google’s Algorithm Rules the Web, that illuminates some of the reasons why Google’s search algorithm works so well and sets Google apart from other search engines.
“The algorithm is extremely important in search, but it’s not the only thing,” says Brian MacDonald, Microsoft’s VP of core search. “You buy a car for reasons beyond just the engine.”
Google’s response can be summed up in four words: mike siwek lawyer mi.
Amit Singhal types that koan into his company’s search box. Singhal, a gentle man in his forties, is a Google Fellow, an honorific bestowed upon him four years ago to reward his rewrite of the search engine in 2001. He jabs the Enter key. In a time span best measured in a hummingbird’s wing-flaps, a page of links appears. The top result connects to a listing for an attorney named Michael Siwek in Grand Rapids, Michigan. It’s a fairly innocuous search — the kind that Google’s servers handle billions of times a day — but it is deceptively complicated. Type those same words into Bing, for instance, and the first result is a page about the NFL draft that includes safety Lawyer Milloy. Several pages into the results, there’s no direct referral to Siwek.
The comparison demonstrates the power, even intelligence, of Google’s algorithm, honed over countless iterations. It possesses the seemingly magical ability to interpret searchers’ requests — no matter how awkward or misspelled. Google refers to that ability as search quality, and for years the company has closely guarded the process by which it delivers such accurate results….
Of course, just by being written about and linked to from Wired, Lawyer Siwek’s search results have all been skewed. But no matter — that’s to be expected. What interests me is how Google is essentially playing with language, like we do here at Wordlab, in a continuous effort to break it up and recombine it to figure out contextual relationships and semantic meaning.
…Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.
Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”
But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”
…for the most part, the improvement process is a relentless slog, grinding through bad results to determine what isn’t working. One unsuccessful search became a legend: Sometime in 2001, Singhal learned of poor results when people typed the name “audrey fino” into the search box. Google kept returning Italian sites praising Audrey Hepburn. (Fino means fine in Italian.) “We realized that this is actually a person’s name,” Singhal says. “But we didn’t have the smarts in the system.”
The Audrey Fino failure led Singhal on a multiyear quest to improve the way the system deals with names — which account for 8 percent of all searches. To crack it, he had to master the black art of “bi-gram breakage” — that is, separating multiple words into discrete units. For instance, “new york” represents two words that go together (a bi-gram). But so would the three words in “new york times,” which clearly indicate a different kind of search. And everything changes when the query is “new york times square.” Humans can make these distinctions instantly, but Google does not have a Brazil-like back room with hundreds of thousands of cubicle jockeys. It relies on algorithms.
The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”
This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”
Thanks to the Interweb, Mike Siwek and Audrey Fino (and, from the article, Garden Grove psychologist Cindy Louise Greenslade) are now cyberlebrities, endlessly spidered, SERPed, and iterated through the dreamtime that is not real, yet we think about it. Like conceptual art.