R. Cilibrasi and P. Vitanyi. Automatic meaning discovery using google. Online preprint CL/0412098, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, The Nederlands, 2004. [url]
This paper discusses a powerful and simple idea: using google to assign semantic meaning to words. A comparison of this approach can be made with the Cyc project, which tries to creat an artificial common sense. The indexing created by Google, on the other hand, is unstructured and offers a primitive query capability. “What Google lacks in expressiveness, it makes up for in size.”
The authors propose in this paper a method to automatically extract the meaning of words and phrases from the web using the Google page counts. Intuitively each page indexed by the engine may be viewed as a set of index terms. A search for a particular term returns a certain number of hints. The authors put this in a probabilistic framework, building an index called ‘Normalized Google Distance (NGD)’, which is computed using the relative frequencies of web pages containing the search terms and which gives objective information about the semantic relations between the search terms.
The authors provide a matematical formalism for their framework and they demonstrate positive correlations, evidencing semantic structure, in both numerical symbol notations and in a variety of other context.