S. Deerwester, S. T. Doumas, T. K. Landauer, G. W. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6):391–407, 1990. [url]
This is the first technical LSI paper. It offers a good background on the previous theory and on the technical choices behind the algorithm.
Essentially, the authors developed the technique to overcome the deficiency of current information retrieval methods, that is the words searchers often use are not the same as those by which the information they seek has been indexed. The two side of the issue they describe are the synonymy and the polysemy. The former is the fact that there are many ways to refer to the same concept. The latter is the fact that most words have more than one distinct meaning.
The criteria they used to distinguish between different models are: adjustable representational richness (hierarchical clustering is too restrictive; they looked for models whose power could be varied); explicit representation of both terms and documents (as in two-mode factor analysis or tree unfolding); computational tractability for large datasets.
The latent semantic structure analysis starts with a matrix of terms by documents. Then this is analyzed by Singular Value Decomposition (SVD), to derive the particular latent semantic structure model. The process divide the original matrix into three representative matrix, which contains “eigenvectors” and “eigenvalues”, and which show a breakdown of the original data into linearly indipendent components or factors.
Three particular comparisons were of interest for the authors: the comparisons of two terms; two comparisons of two documents; and a comparison of a term and a document.