M. Ceglowski, A. Coburn, and J. Cuadrado. Semantic search of unstructured data using contextual network graphs. Preliminary white paper, National Institute for Technology and Liberal Education, Middlebury College, Middlebury, Vermont, 05753 USA, 2003. [url]
This paper describe the Contextual Network Graph, a technique that should solve some of the pitfalls of the latent semantic indexing, like the poor scalability of the singular value decomposition algorithm.
The authors offer an alternative interpretation of the term-document matrix (TDM), which is essentially a lookup table of term frequency data for the entire document collection. In LSI this is interpreted as a high-dimensional vector space. Alternatively, this can be seen as a bipartite graph of term and document nodes where each non-zero value in the TDM corresponds to an edge connecting a term node to a document node. In this way, every term is connected to all of the documents in which the term appears, and every document has a link to each term contained in the document. The authors call this a contextual network graph.
This construct correspond to the intuition that documents that share many rare tems are likely to be semantically related.
Their idea is to use this representation to search a document collection energizing a query node and allowing the energy to propagate to other nodes along the edges of the graph based on a set of simple rules.
Their results show comparable results to an LSI search engine. In a 1981 dissertation at the University of Illinois, Scott Preece describes an almost identical technique under the name spreading activation search.
Tags: Exploratory Data Analysis, information retrieval, Latent Semantic Analysis, text data mining