M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the Special Interest Group on Information Retrieval SIGIR’05, pages 162–169, Salvador, Brazil, August 15-19 2005. ACM. [pdf]
This paper discuss the way other researcher comments their results on the effectiveness of Information Retrieval systems. Effectiveness, in this context, is computed by measuring the ability of systems to find relevant documents. Statistical significance tests like the t-test and the Wilcoxon are commonly used. These tests are based on a series of assumptions: namely that the values being tested are distributed symmetrically and normally, and that are a random sample of the population. Additionally, the tests produce type I and type II errors. A type I is a false positive with an incidence of 1 every 20 tests at p-value of 0.05. A type II is a false negative with an unknown incidence.
The results computed by the authors show that if significance tests are omitted or if the improvements are small, results are not reliable. Some conclusions sets a teting methodology:
1- they found the t-test more reliable than the the alternatives, based on the false-positive rate of 5%;
2- another criteria found is that the Mean Average Precision (MAP) difference between the tests should be bigger or equal to 20% with a positive significance test in order to demonstrate positive results for the tested IR engines;
3- for small topic set sizes (<25) observing statistical significance does not guarantee that a result will be repeatable in other sets of topics.