M. Sahlgren. An introduction to random indexing. In Proceeding of the Workshop at the 7th International Conference on Terminology and Knowledge Engineering, conference Methods and Applications of Semantic Indexing, Copenhagen, Denmark, 16th August 2005. [pdf]
———————–
This paper presents the Random Indexing algoritm that is introduced as a good alternative to LSI and similar word space methods that are based on the distributional hypothesis, which states that words with similar meaning tend to occur in similar contexts.
The author states the limit of word space models, which is the efficiency and the scalability problem: the co-occurrence matrix will become soon computationally intractable when the vocabulary grows. Additionally, the author highlight the fact that a majority of the cells will be zero: a tiny amount of the words in language are distributionally promiscuous; the vast majority of words only occur in a very limited set of contexts.
Available methods, like LSI uses for this purpose a matrix truncation to reduce the dimensionality that according to the author should be avoided for three reasons: 1) the reduction is computationally costly; 2) the reduction is one-time operation that needs to be redone each time a new dimension is added; 3) the reduction requires initial sampling of the data that is often done with ad-hoc solutions.
Random Indexing goes on the contrary of traditional views (i.e., first construct co-occurrence matrix, then extract context vectors) moving from the accumulation of context vecotrs to the construction of the co-occurrence matrix.
Tags: information retrieval, Latent Semantic Analysis, Singular Value Decomposition