tf*idf

Term Frequency-Inverse Document Frequency. A kind of DocumentVector. This scheme assigns a weight to each term (vocabulary word) in a given document. The weight increases proportional to the number of times the term occurs in the document, but is offset by a term which devalues terms common in the overall corpus.

One formula (apparently a simplification of (Salton and Buckley, ’88)) is the following. The weight of a term t in a document D is:

(# of occurrences of term t in this document D) * log((total # of documents)/(# of documents with mention of term t))

References:

Gerard Salton , Christopher Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management: an International Journal, v.24 n.5, p.513-523, 1988

Copyright notice: the present content was taken from the following URL, the copyrights are reserved by the respective author/s.

Mauro Cherubini

Professor at the University of Lausanne, Switzerland

Leave a Reply Cancel reply