Information Retrieval, Scientific Methodology, and privacy

Well, now that I am attending SIGIR I have to admit that sometimes I am a bit pissed by some presentations from big industries like Google, Yahoo, MSN, Ask, etc. (maybe i am just envious)

They present great results because they can access the usage log of a great deal of users. Unfortunately, their dataset is almost always kept private claiming privacy issues. This means that basically we cannot verify most of the claims raised in the papers. However, these works are accepted by the scientific community.

I think that scientific contributions built on top of private dataset should not been accepted in mainstream conferences if their results cannot be replicated by other institution using the same data. I do, however, understand that sometime disclosure of this information can results in troubles for the companies. So I propose two tracks:

1. Let’s create a special track in each mainstream conference for papers that present non-reproducible results.

2. Let’s do some research on how scramble the data in a private dataset to maintain the user privacy maintaining the statistical validity of the dataset.

