R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In L. Dekang and W. Dekai, editors, Proceedings of EMNLP 2004, pages 404–411, Barcelona, Spain, July 2004. Association for Computational Linguistics. [pdf]
This paper presents the TextRank algorithm for information retrieval and automatic keywords extraction from texts. Based on the PageRank introduced by Brin and Page (1998) the author shows how it is possible to add vertex weighting to the graph, with better results for the retrieval.
The ranking of the vertex is done with ‘vote casting’ following the principle that when a vertex link to another one, it is basically casting a vote for that other vertex. The higher the number of votes the higher the importance of the vertex. The scores associated with each vertex are determined based on the votes that are cast for it and the score of the vertices casting these votes.
The paper presents also a good literature review on keywords extraction, some of which based on baysian methods, supervised and not. The authors chose finally one method, namely Hull (2003), as comparison limit for their study.
They use the TextRank algorithm for keywords extraction, which is based on the co-occurrence relation, controlled by distance between words occurrence: two vertices are connected if their corresponding lexical units co-occur within a window of maximum N words, where N can be set anywhere from 2 to 10 words. Therefore, co-occurrence links express relations between syntactic elements, and similar to the semantic links found useful for the task of word sense disambiguation.
Additionally, as the authors observed, the vertices added to the graph can be pruned using part of speech filtering, to restrict the algorithm only to certain syntactinc cathegories, like nouns and verbs. The authors experimented with different syntactic filters, with best results observed for nouns and adjectives only.
The authors registered the highest F-measure for their TextRank method. Additionally, results showed that the larger the window the lower the precision, probablye explained by the fact that a relation between words that are further apart in sot strong enough to define a connection in the text graph. Results are aligned with those of Hull (2003) that linguistic information helps the process of keywords extraction.