I had the great opportunity to discuss with S. Dumais of my thesis’ project. One of the main questions I had was how to evaluate the goodness of a retrieval algorithm for geographical messaging. Dr. Dumais pointed out that any implicit measure of effectiveness is extremely dependant from presentation. How results are arranged, which font was used, which color and which position in the rank list is going to affect the way people will use the system.
Additionally, users have different speed while browsing. They have different cognitive styles that bring them to formulate queries in a different manner.
In my situation is very difficult to propose a way to establish similarities and to measure effectiveness because I do not have a clear model of what tasks the users are trying to accomplish. The system support open ended conversations which reflects in the lack of specific tasks. Along the same argument, it is very difficult for me to define which factors are important in the query / selection. My top list currently includes: semantic matching, geographic proximity, social rating, and contextual appropriateness (a mix of spatial and temporal factor in relation with personal objectives).
On the other extreme of the spectrum, we have the standard IR evaluation techniques that suggest having the effectiveness measured a priori by judges. The challenge of this approach resides in the fact that detaching the application from its natural constraint invalidates completely its ecological validity. Also to evaluate with this technique one needs to keep the query constant between two different algorithms.
A live system introduces noise in the evaluation process. Dr. Dumais pointed me to different studies trying to evaluate different presentation techniques (like Optimizing Search by Showing Results In Context).
Finally we brainstormed a bit on possible ways to tackle the problem and find a new approach to give the best matches. One of the ideas that emerged was that of the batch measurement of relevance (the user throws a query, the system returns a list of all the results and the user is asked to define the relevance). Another idea was that of leaving the community decides on the relevance of a certain result.