Performance measures in Information Retrieval

With the increasing number of tools for Information Retrieval, it is very important to find common tools for analyzing the ‘goodness’ of the performances with the end of making comparisons. When searching and retrieving documents there are four groups of results that are of interest for quantitative analysis:

  • Relevant documents that were retrieved by the system (we call group A)
  • Irrelevant documents that were retrieved by the system (we call group B)
  • Relevant documents that the system missed (we call group C)
  • Irrelevant documents that were not retrieved by the system (we call group D)
Not retrieved

Intuitively a good system should try to maximize the retrieval of relevant documents, and minimize the retrieval of irrelevant documents. Recall and Precision are the fundamental parameters defining the behavior of an information retrieval system and are defined as follows:

Recall: defines the number of relevant documents retrieved as a fraction of all relevant documents


Precision: defines the number of relevant documents as a fraction of all documents retrieved by the system. Precision defines the level of noise in the information presented to the user.


Research has shown that is very difficult to achieve a high level of recall without sacrificing precision (Y. Yang et al). At the recall rate increases, the precision deteriorates very rapidly. This trend is not linear depending on the type and quality of the information and the retrieval algorithm precision. Additional metric focus on what the retrieval engine has missed instead of what is retrieved.

Missed documents: defines the number of relevant documents missed by the search engine. It is the inverse of the recall rate.


False Alarm: the measure of the level of noise in the output.


Because of the inverse relation between recall and precision, some suggest calculating the harmonic mean of the two. This is also called F-measure in the literature.


The above formula is also known as the F1 measure, because precision and recall are evenly weighted. This belong to a general formula that is:


Two other commonly used F measures are the F0.5 measure, which weights precision twice as much as recall, and F2 measure, which weights recall twice as much as precision.

Additionally, the Mean Average Precision (MAP): over a set of queries, the average precision is the precision after each relevant document is retrieved.


Where r is the rank, N the number of retrieved documents, rel() is a binary function on the relevance of a given rank, and P() is the precision at a given cut-off rank.

Tags: ,

Leave a Reply