Snowball: A language for stemming algorithms

Algorithmic stemmers continue to have great utility in IR, despite the promise of out-performance by dictionary-based stemmers. Nevertheless, there are few algorithmic descriptions of stemmers, and even when they exist they are liable to misinterpretation. Snowball, is a language defined by Porter, in which stemmers can be exactly defined, and from which fast stemmer programs in ANSI C or Java can be generated. A range of stemmers is presented in parallel algorithmic and Snowball form, including the original Porter stemmer for English.

PyStemmer provides stemmer functionality in Python for English, German, Norwegian, Italian, Dutch, Portuguese, French, Swedish. PyStemmer is based on the Snowball stemmer.

Tags: , ,

Cross Language Evaluation Forum

The Cross-Language Evaluation Forum (CLEF) supports global digital library applications by (i) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in both monolingual and cross-language contexts, and (ii) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes.

Through the organisation of system evaluation campaigns, the aim is to create a community of researchers and developers studying the same problems and to facilitate future collaborative initiatives between groups with similar interests. CLEF will also establish strong links, exchanging ideas and sharing results, with similar cross-language evaluation initiatives in the US and Asia, working on other sets of languages. The final goal is to assist and stimulate the development of European cross-language retrieval systems in order to guarantee their competitiveness on the global marketplace.

Their site at the University of Neuchatel contains a great list of resources for natural language processing like stemmers and stop words lists.

Tags: ,

Clean Up Parliament!

Beppe Grillo and thousands of others Italians (including me) have bought a page on the International Herald Tribune to inform about and protest against the 23 members of the parliament that have been convicted of a variety of crimes (including corruption) and yet are allowed to sit in the Parliament.

Recently he has been interviewed by the BBC concerning the reasons of this protests. You can hear the interview here. Here is the list containing the 23 names of the member of the Parliament with the indication of their charges.

Beppe Ht

Tags: ,

Focused Crawling Using Context Graphs

M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 26th VLDB Conference, pages 527–534, Cairo, Egypt, 2000. [pdf]

——————-

This article presents an approach to web crawling which is defined as focused as opposed to standard crawling. Standard crawling uses a technique called ‘forward crawling’, that is following the links found on a page to the nexts. This has the limit to not exploit the lower levels of the trees of a web site from the point of entry of the crawler.

The authors also list other limits of the standard crawling technique: a) the limited ability to sacrifice short term document retrieval gains in the interest of better crawl performance; b) the lack of learning strategies where topically relevant documents are found by following off-topic pages.

The authors’ contribution is the Context Focused Crawler, an engine that uses Google or Altavista to define the ‘backward crawling’ from a certain page. This is used to build a context graph of the page: a network of documents that moves from the target document backwards following the hyperlinks.

Afterwards these documents are used to learn some classifier using a reduced version of the TF/IDF algorithm. Subsequently, by computing a naive bayes likelihood function for the winning layer, and comparing this to a threshold, it is possible to discard weakly matching pages.

A side remark is that the authors uses the name Context Graph but this has not to be confused with the Ceglowski [2003] Contextual Network Graph. The reason is that the graph is not assembled in the same way using a Term/Document procedure and also that the retrieval function is not base on Spreading Activation technique.

Tags: ,

Spam Filtering using Contextual Network Graphs

D. Kelleher. Spam filtering using contextual network graphs. Master’s thesis, Computer Science Department, Trinity College, Dublin, Ireland, 2004. [pdf]

——————-

This paper present a comparaison between three differnet classification methods based on Naive-bayesian, Latent Semantic Indexing and Contextual Network Graph for the purpose of improving common spam filtering mechanisms.

The author show how LSI and CNG outperform the Naive-Bayes Classifier for the ability that those two methods have of dealing with text polysemy and synonymy. Polysemy reduces in fact the strenght of keywords by increasing the number of classes in which a polysemous word can appear. Synonymy reduces the strenght of keywords by spreading their value among a number of synonyms. The two methods above have a powerful approach to classification of natural language text because they use a context-based search that take into account the semantic links between words and search over a semantic space rather than simply a list of keywords.

The foundamental problem of the naive-bayes probabilistic approach is that it makes the assumption that each word in a document is indipendent from the others. This independence assumption makes this classifier unsuitable for natural language processing since it loses the contextual relations between words and fails to address the issues above.

The author uses its own implementation of the above methods using a public available dataset and a Stop-Word removal. The implementation was done in JAMA a Java Matrix Package.

The success of a certain method was measured using some machine learning markers: a k-fold cross validation testing mechanism.The author report how the most important results from the test was not the accuracy, but the fact that the high complexity of SVD makes its operation significantly slower compared to the other two.

Additionally, results showed than the ratio between the initial energy activation and the energy threshold of CNG had to be adjusted to maintain the accuracy while increasing the corpus size. This may be related to the fact that growing the dataset the threshold value should be adapted as suggested also by Kimoto and Iwadera [1989](p.236): ‘this suggest that the threshold value should increase with trasversal distance.’

In conclusion the results shows that CGN should be prefered over LSI because of the computational speed and because it is not patented. 

Tags: , , , , ,

Software that goes … on a stick

Waiting for the Ajax-Web2.0 revolution, I found this intermediary solution that allows to bring around great collection of softwares without the weight of a laptop or an handheld. Sometimes, in fact, I found myself in the situation of needing a decent piece of software on a machine I was working on without the hassle to have to install the thing.

Recently I started noticing a growing number of projects to port common software apps into a USB stick format so that can be run cross platform. Here is a collection I assembled recently:

Tags:

Construction of a Dynamic Thesaurus and Its Use for Associated Information Retrieval

H. Kimoto and T. Iwadera. Construction of a dynamic thesaurus and its use for associated information retrieval. In Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, pages 227 – 240, Brussels, Belgium, 1989. ACM Press, New York, NY, USA. [pdf]

——————–

This paper describe the AIRS system (Associated Information Retrieval System), where for the first time (?) a dynamic thesaurus was used in a associative retrieval system. Basically the authors built a term by term matrix of keywords that was assemble by putting together information from a user’s sample of relevant documents and a static thesaurus.

The dynamic thesaurus was constructed on a network structure where each node was representing a keyword, while each link was representing a relationship between keywords. A weight was assigned to each node and to each link between the nodes. Four different thesaurus were therefore derived combining the presence of the keywords weighting of nodes and links. This served the experimental evaluation.

Results suggested that node weighting was effective for improving the Keywords precision. Additionally, the finding suggested that the activation threshold value of the node should increase with the transversal distance of the network routing. Finally the use of the thesaurus with weighted nodes and links showed an increase of precision rate and recall rate.

Tags: ,