I found two interesting articles, both written by this guy, Both article concentrates on how Google Set should actually works. The author tries to operate a reverse engineer on the base of few observation on the actual way the algorithm should work.
He seems to suggest a parsing of the web content that should take into account the frequencies by which a word is counted in the same context/profile. However s/he seems to suggest that the algorithm parses the pages looking for tables or lists. This seems to be quite limiting and I personally think that in fact there should be some more complicated workings that possibly relays on clustering or similar statistical method.
Funny enough, the author suggest that similar techniques might be used to solve linguistic problems, and A.I. situations where there is the need to get to the actual meaning of the words.
Tags: google, text data mining