The address book desk: an example of interactive furniture

December 17, 2005Mauro Cherubini Leave a comment

Timo Arnall had a great idea: using post its on a desk and a mobile phone to trigger specific events and actions. The idea behind the address book desk is very simple: to each Post-It on the table top, correspond a rfid tag that can be programmed with a specific trigger. Now the mobile comes to play, because when close to the tag, with a specific reader can match the will of the user to activate the specific trigger indicated on the Post-It. Very simple, yet very powerful.

[full story]

Copyright notice: the present content was taken from the following URL, the copyrights are reserved by the respective author/s.

Tags: collaboration tools, interaction design, interactive furniture, location awareness, RFID, tagging, tangible interface

Open Source Information Retrieval Systems

December 16, 2005Mauro Cherubini Leave a comment

GalaTex XQuery Full-Text Search (XML query text search)
Lemur Language Modelling IR Toolkit
Lucene Apache Jakarta project
SMART Early IR engine from Cornell University
Terrier Information Retrieval Platform
Xapian Open source IR platform based on Muscat
Zettair
DIG Open source web crawling software
iHOP Information retrieval system for the biomedical domain
MG full-text retrieval system Now maintained by the Greenstone Digital Library Software Project
Information Storage and Retrieval Using Mumps

Tags: information retrieval, open source

Computational Methods for Intelligent Information Access

December 16, 2005Mauro Cherubini Leave a comment

M. W. Berry, S. T. Dumais, and T. A. Letsche. Computational methods for intelligent information access. In Supercomputing ’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, page 20, San Diego, California, USA, 1995. ACM Press. [pdf]

—————–

This paper presents a detailed introduction to the Latent Semantic Indexing method, detailing the mathematical foundations and a visual example of the idea behind the singular value decomposition which is at the core of the method.

The main assumption of LSI is that there is some underlying or latent tructure in word usage that is partially obscured by variability in word choice. A truncated singular value decomposition is at the core of the method used to estimate the structure in word usage acreoss documents. Retrieval is then performed using the database of singular values and vectors obtained from the truncated SVD.

Mainly SVD reveals important information about the structure of a matrix, sparsifying minor differences in terminology that are at the base of synonymy and polisemy, a plague of information retrieval.

The paper also details the computational costs required to maintain up to date the matrices at the base of the method.

Tags: information retrieval, Latent Semantic Analysis, Singular Value Decomposition

Same-Language Subtitling: a possible learning revolution?

December 15, 2005Mauro Cherubini Leave a comment

I always think that simple ideas can change the world: this one seems to me one of those. It seems to me so simple that I could not believe it was not implemented before. In short the idea is to use a karaoke subtitling on popular movies and song and make this available on mass media content. In this way, illiterate people can start associating the pronunciation of words with the written form.

Same-Language Subtitling:

My organization, PlanetRead, works in Mumbai and Pondicherry, India. We have developed a “Same-Language Subtitling” (SLS) methodology, which provides automatic reading practice to individuals who are excluded from the traditional educational system, or whose literacy needs are otherwise not being met. This is an educational program rooted in mass media that demonstrates how a specific literacy intervention can yield outstanding, measurable results, while complementing other formal and non-formal learning initiatives of the government, private sector, and civil society. We are fortunate to have just been selected as a Google Foundation grantee.

More than 500 million people in India have access to TV and 40 percent of these viewers have low literacy skills and are poor. Through PlanetRead’s approach, over 200 million early-literates in India are getting weekly reading practice from Same Language Subtitling (SLS) using TV. The cost of SLS? Every U.S. dollar covers regular reading for 10,000 people – for a year.

I hit upon this idea in 1996 through a most ordinary personal experience. While taking a break from dissertation writing at Cornell University, I was watching a Spanish film with friends to improve my Spanish. The Spanish movie had English subtitles, and I remember commenting that I wished it came with Spanish subtitles, if only to help us grasp the Spanish dialogue better. I then thought, ‘And if they just put Hindi subtitles on Bollywood songs in Hindi, India would become literate.’ That idea became an obsession. It was so simple, intuitively obvious, and scalable in its potential to help hundreds of millions of people read — not just in India, but globally. So you can see how it works, we’ve uploaded some folk songs using SLS into Google Video.

Tags: education, google, hack, pedagogy, politics

TextRank: Bringing Order into Texts

December 15, 2005Mauro Cherubini Leave a comment

R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In L. Dekang and W. Dekai, editors, Proceedings of EMNLP 2004, pages 404–411, Barcelona, Spain, July 2004. Association for Computational Linguistics. [pdf]

—————

This paper presents the TextRank algorithm for information retrieval and automatic keywords extraction from texts. Based on the PageRank introduced by Brin and Page (1998) the author shows how it is possible to add vertex weighting to the graph, with better results for the retrieval.

The ranking of the vertex is done with ‘vote casting’ following the principle that when a vertex link to another one, it is basically casting a vote for that other vertex. The higher the number of votes the higher the importance of the vertex. The scores associated with each vertex are determined based on the votes that are cast for it and the score of the vertices casting these votes.

The paper presents also a good literature review on keywords extraction, some of which based on baysian methods, supervised and not. The authors chose finally one method, namely Hull (2003), as comparison limit for their study.

They use the TextRank algorithm for keywords extraction, which is based on the co-occurrence relation, controlled by distance between words occurrence: two vertices are connected if their corresponding lexical units co-occur within a window of maximum N words, where N can be set anywhere from 2 to 10 words. Therefore, co-occurrence links express relations between syntactic elements, and similar to the semantic links found useful for the task of word sense disambiguation.

Additionally, as the authors observed, the vertices added to the graph can be pruned using part of speech filtering, to restrict the algorithm only to certain syntactinc cathegories, like nouns and verbs. The authors experimented with different syntactic filters, with best results observed for nouns and adjectives only.

The authors registered the highest F-measure for their TextRank method. Additionally, results showed that the larger the window the lower the precision, probablye explained by the fact that a relation between words that are further apart in sot strong enough to define a connection in the text graph. Results are aligned with those of Hull (2003) that linguistic information helps the process of keywords extraction.

Tags: information retrieval, statistics, text data mining

Yoono: a collaborative search engine

December 15, 2005Mauro Cherubini Leave a comment

Yoono is a free software application you can download which combines for the first time the management and sharing of information. Based on pooling user knowledge, yoono is a collaborative search engine and an innovative communication tool.

Yoono’s search results are obtained from the bookmarks of web users. Searces are independent of the user’s language sine yoono does not index keywords on the web but the bookmarks of the users of yoono.

Search results are ordered by popularity. The popularity or audience of a web site is the number of times it has been added to the yoono community for this search. Yoono also offers you a list of experts on the subject of the search. It identifies an expert using the your search – if an expert has published the url in a folder, he is identified in the results by nickname. The experts are ordered by level of expertise which corresponds to the number of users subscribed to the published folder.

When you download yoono, you also participate in yoono’s search results

Tags: collaboration tools, folksonomy, google, information retrieval

Mozdex: an Open Source kind of Google

December 15, 2005Mauro Cherubini Leave a comment

MozDex is a search engine seeded from the dmoz.org directory. MozDex uses open source search technologies to create an open and fair index. Their goal is to index the entire web in html content. They want to be able to provide a powerful and open search service to the community.

Some people may say that providing the insight into the results will offer cheaters a better way to get higher ranks, but our view is it allows us to openly discover and communicate new methods and algorithms that give a better view and representation that is less Fallible to cheaters.

Proprietary systems have already been done. We are here to utilize open technologies and open source to build an index that doesn’t use proprietary software, processes or algorithms. Freedom of information and how that information gets to you is what mozdex is about. Mozdex is built around the Nutch Search technology. Thanks to the developers and companies listed in Nutch Credits for making this software and project possible.

Tags: google, information retrieval, politics, open source

Performance measures in Information Retrieval

December 15, 2005Mauro Cherubini Leave a comment

With the increasing number of tools for Information Retrieval, it is very important to find common tools for analyzing the ‘goodness’ of the performances with the end of making comparisons. When searching and retrieving documents there are four groups of results that are of interest for quantitative analysis:

Relevant documents that were retrieved by the system (we call group A)
Irrelevant documents that were retrieved by the system (we call group B)
Relevant documents that the system missed (we call group C)
Irrelevant documents that were not retrieved by the system (we call group D)

	Relevant	Irrelevant
Retrieved	A	B
Not retrieved	C	D

Intuitively a good system should try to maximize the retrieval of relevant documents, and minimize the retrieval of irrelevant documents. Recall and Precision are the fundamental parameters defining the behavior of an information retrieval system and are defined as follows:

Recall: defines the number of relevant documents retrieved as a fraction of all relevant documents

Precision: defines the number of relevant documents as a fraction of all documents retrieved by the system. Precision defines the level of noise in the information presented to the user.

Research has shown that is very difficult to achieve a high level of recall without sacrificing precision (Y. Yang et al). At the recall rate increases, the precision deteriorates very rapidly. This trend is not linear depending on the type and quality of the information and the retrieval algorithm precision. Additional metric focus on what the retrieval engine has missed instead of what is retrieved.

Missed documents: defines the number of relevant documents missed by the search engine. It is the inverse of the recall rate.

False Alarm: the measure of the level of noise in the output.

Because of the inverse relation between recall and precision, some suggest calculating the harmonic mean of the two. This is also called F-measure in the literature.

The above formula is also known as the F1 measure, because precision and recall are evenly weighted. This belong to a general formula that is:

Two other commonly used F measures are the F0.5 measure, which weights precision twice as much as recall, and F2 measure, which weights recall twice as much as precision.

Additionally, the Mean Average Precision (MAP): over a set of queries, the average precision is the precision after each relevant document is retrieved.

200512151006

Where r is the rank, N the number of retrieved documents, rel() is a binary function on the relevance of a given rank, and P() is the precision at a given cut-off rank.

Tags: information retrieval, statistics

Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability

December 13, 2005Mauro Cherubini Leave a comment

M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the Special Interest Group on Information Retrieval SIGIR’05, pages 162–169, Salvador, Brazil, August 15-19 2005. ACM. [pdf]

——————-

This paper discuss the way other researcher comments their results on the effectiveness of Information Retrieval systems. Effectiveness, in this context, is computed by measuring the ability of systems to find relevant documents. Statistical significance tests like the t-test and the Wilcoxon are commonly used. These tests are based on a series of assumptions: namely that the values being tested are distributed symmetrically and normally, and that are a random sample of the population. Additionally, the tests produce type I and type II errors. A type I is a false positive with an incidence of 1 every 20 tests at p-value of 0.05. A type II is a false negative with an unknown incidence.

The results computed by the authors show that if significance tests are omitted or if the improvements are small, results are not reliable. Some conclusions sets a teting methodology:

1- they found the t-test more reliable than the the alternatives, based on the false-positive rate of 5%;

2- another criteria found is that the Mean Average Precision (MAP) difference between the tests should be bigger or equal to 20% with a positive significance test in order to demonstrate positive results for the tested IR engines;

3- for small topic set sizes (<25) observing statistical significance does not guarantee that a result will be repeatable in other sets of topics.

Tags: information retrieval, statistics, text data mining

TREC Datasets: Text REtrieval Conference Datasets for Information Retrieval

December 13, 2005Mauro Cherubini Leave a comment

TREC is a series of conference organized by various member of the US Defense Department around the concept of Information Retrieval. Various datasets are made available to the public to test and develop different search engines. One of the greatest advantages of these data is the availability of a relevance measure for a given query, that was computed by a number of experts that blinded reviewed the given query against the given results.

TREC uses the following working definition of relevance: If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant. Only binary judgments (“relevant” or “not relevant”) are made, and a document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document).

Judging is done using a pooling technique (described in the Overview papers in the TREC proceedings) on the set of documents used for the task that year. The relevance judgments are considered “complete” for that particular set of documents. By “complete” we mean that enough results have been assembled and judged to assume that most relevant documents have been found. When using these judgments to evaluate your own retrieval runs, it is very important to make sure the document collection and qrels match.

Tags: information retrieval

Mauro Cherubini

Professor at the University of Lausanne, Switzerland

Uncategorized

The address book desk: an example of interactive furniture

Open Source Information Retrieval Systems

Computational Methods for Intelligent Information Access

Same-Language Subtitling: a possible learning revolution?

TextRank: Bringing Order into Texts

Yoono: a collaborative search engine

Mozdex: an Open Source kind of Google

Performance measures in Information Retrieval

Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability

TREC Datasets: Text REtrieval Conference Datasets for Information Retrieval