Mike Siwek and Audrey Fino demonstrate Google’s semantic genius

Wired has an interesting article in its March, 2010 issue, How Google’s Algorithm Rules the Web, that illuminates some of the reasons why Google’s search algorithm works so well and sets Google apart from other search engines.

“The algorithm is extremely important in search, but it’s not the only thing,” says Brian MacDonald, Microsoft’s VP of core search. “You buy a car for reasons beyond just the engine.”

Google’s response can be summed up in four words: mike siwek lawyer mi.

Amit Singhal types that koan into his company’s search box. Singhal, a gentle man in his forties, is a Google Fellow, an honorific bestowed upon him four years ago to reward his rewrite of the search engine in 2001. He jabs the Enter key. In a time span best measured in a hummingbird’s wing-flaps, a page of links appears. The top result connects to a listing for an attorney named Michael Siwek in Grand Rapids, Michigan. It’s a fairly innocuous search — the kind that Google’s servers handle billions of times a day — but it is deceptively complicated. Type those same words into Bing, for instance, and the first result is a page about the NFL draft that includes safety Lawyer Milloy. Several pages into the results, there’s no direct referral to Siwek.

The comparison demonstrates the power, even intelligence, of Google’s algorithm, honed over countless iterations. It possesses the seemingly magical ability to interpret searchers’ requests — no matter how awkward or misspelled. Google refers to that ability as search quality, and for years the company has closely guarded the process by which it delivers such accurate results….

Of course, just by being written about and linked to from Wired, Lawyer Siwek’s search results have all been skewed. But no matter — that’s to be expected. What interests me is how Google is essentially playing with language, like we do here at Wordlab, in a continuous effort to break it up and recombine it to figure out contextual relationships and semantic meaning.

…Google has used its huge mass of collected data to bolster its algorithm with an amazingly deep knowledge base that helps interpret the complex intent of cryptic queries.

Take, for instance, the way Google’s engine learns which words are synonyms. “We discovered a nifty thing very early on,” Singhal says. “People change words in their queries. So someone would say, ‘pictures of dogs,’ and then they’d say, ‘pictures of puppies.’ So that told us that maybe ‘dogs’ and ‘puppies’ were interchangeable. We also learned that when you boil water, it’s hot water. We were relearning semantics from humans, and that was a great advance.”

But there were obstacles. Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context. As Google crawled and archived billions of documents and Web pages, it analyzed what words were close to each other. “Hot dog” would be found in searches that also contained “bread” and “mustard” and “baseball games” — not poached pooches. That helped the algorithm understand what “hot dog” — and millions of other terms — meant. “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”

…for the most part, the improvement process is a relentless slog, grinding through bad results to determine what isn’t working. One unsuccessful search became a legend: Sometime in 2001, Singhal learned of poor results when people typed the name “audrey fino” into the search box. Google kept returning Italian sites praising Audrey Hepburn. (Fino  means fine in Italian.) “We realized that this is actually a person’s name,” Singhal says. “But we didn’t have the smarts in the system.”

The Audrey Fino failure led Singhal on a multiyear quest to improve the way the system deals with names — which account for 8 percent of all searches. To crack it, he had to master the black art of “bi-gram breakage” — that is, separating multiple words into discrete units. For instance, “new york” represents two words that go together (a bi-gram). But so would the three words in “new york times,” which clearly indicate a different kind of search. And everything changes when the query is “new york times square.” Humans can make these distinctions instantly, but Google does not have a Brazil-like back room with hundreds of thousands of cubicle jockeys. It relies on algorithms.

The Mike Siwek query illustrates how Google accomplishes this. When Singhal types in a command to expose a layer of code underneath each search result, it’s clear which signals determine the selection of the top links: a bi-gram connection to figure it’s a name; a synonym; a geographic location. “Deconstruct this query from an engineer’s point of view,” Singhal explains. “We say, ‘Aha! We can break this here!’ We figure that lawyer is not a last name and Siwek is not a middle name. And by the way, lawyer is not a town in Michigan. A lawyer is an attorney.”

This is the hard-won realization from inside the Google search engine, culled from the data generated by billions of searches: a rock is a rock. It’s also a stone, and it could be a boulder. Spell it “rokc” and it’s still a rock. But put “little” in front of it and it’s the capital of Arkansas. Which is not an ark. Unless Noah is around. “The holy grail of search is to understand what the user wants,” Singhal says. “Then you are not matching words; you are actually trying to match meaning.”

Thanks to the Interweb, Mike Siwek and Audrey Fino (and, from the article, Garden Grove psychologist Cindy Louise Greenslade) are now cyberlebrities, endlessly spidered, SERPed, and iterated through the dreamtime that is not real, yet we think about it. Like conceptual art.

Comment

Wordsmash #1: Remixing an article about literary remixing

James Joyce and Helene Hagemann“So much of the energy of great work to me is feeling the echo effect on every line, of not knowing where it came from.” But it has been a limited one, viewed with even greater suspicion now. “Why can’t literature catch up with the other arts?”

The news made waves in the United States with an almost novelistic kind of timing, just before the publication last week of a highly anticipated book by David Shields, “Reality Hunger,” a feisty literary “manifesto” built almost entirely of quotations from other writers and thinkers.

“Our would-be novelist says nothing is original, yet the passages she lifted from other books were original expressions in those books, even if the ideas were not new.” But Mr. Shields argues that blatant borrowing has been a foundation of culture since man first took up pen and paintbrush. “The test has always been in the pudding.”

The law and conventional ethics are still probably a long way from embracing the kind of world that Mr. Shields and Ms. Hegemann envision. But Louis Menand, the Harvard professor and New Yorker staff writer, suggested that, as with any creative movement, if the results are compelling and profound enough, even rigid conventions come around to making what seemed like a sin into a virtue.

“My goodness, it’s just straight out of my brainpan.”

Even the most original-seeming writing borrows from the centuries of writing that came before, so why not simply be more honest and maybe do something more interesting in the process? “We worked for years on the character development and the voice, and when we finally nailed the subtle epiphany, we cracked open a bottle of Champagne to celebrate.”

The borrowed words are marshaled to make a case against what Mr. Shields sees as boring fiction and in favor of genre-bending forms like the lyric essay. You could argue, of course, that Warhol’s use of a soup can or Danger Mouse’s use of the Beatles and Jay-Z on the Grey Album represent one thing, a re-contextualizing of cultural artifacts so well known they are a kind of shorthand. “There’s no such thing as originality anyway, just authenticity.”

Ms. Hegemann announced that appropriating the passages from that book and other sources was her plan all along. And Terence complained in the second century B.C. that “there’s nothing to say that hasn’t been said before.”

Mr. Shields, so firmly in the camp that sees appropriation as just another kind of collaboration, laments that expressive writing has lagged behind the other arts in using appropriation as a tool. “She basically did the book I wanted to do.”

Helene Hagemann and James JoyceA child of a media-saturated generation, she presented herself as a writer whose birthright is the remix, the use of anything at hand she feels suits her purposes, an idea of communal creativity that certainly wasn’t shared by those from whom she borrowed. His manifesto and Ms. Hegemann’s novel prompted the quick drawing of battle lines.

But does lifting from an obscure blogger — or even importing a description of a sunset by Steinbeck or a suburban tableau from Updike — accomplish the same thing? The most vital artists are those “breaking larger and larger chunks of ‘reality’ into their works.” Mr. Shields, a novelist who migrated to nonfiction, has called it “far and away the most personal book” he has ever written.

Mr. Shields’s book relies on thinkers from Wittgenstein to DJ Spooky, melding them into a voice that can sound at times eerily consistent. “If something is really successful, then the law tends to get changed and society changes to allow it to happen,” he said.

And though publishing-house lawyers required him to include an appendix listing his sources (at least those he could remember) Mr. Shields asks the reader to honor the spirit of the book by taking a pair of scissors and giving it an appendectomy. Maybe that’s one reason for the flurry of attention recently about a teenage German novelist, Helene Hegemann. Think of almost any kind of cultural endeavor and then use the word “we” to describe its creation.

Tensions have probably never been higher between a growing culture of borrowing and appropriation on one side and, on the other, copyright advocates and those who fear a steady erosion of creative protections. Flarf, the experimental poetry movement in which practitioners make verse out of the results of random Internet word searches — for those times “when we are not sure we are alive.”

Appropriation has breathed life into music, art and theater, he argues, and he lines up a kind of murderers’ row of writers, including Sterne, Emerson, Eliot and Joyce (“I am quite content to go down to posterity as a scissors and paste man”) to make the case that it has been an important tradition in writing, too.

A creative culture dominated by borrowing and repurposing is a “culture that will quickly grow stale.” In a world where the death of the novel has been announced with great regularity for almost half a century, such an open-source approach is the only way to keep literature alive.

Unmix this mashup at the New York Times, The Free-Appropriation Writer.

Comment