SMS Corpus

Patrick pointed me to this great collection of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. As of April 2004, the corpus consists of about 10,000 SMS messages collected by students. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.

Using this collection it might be possible for me to verify some initial intuition on the relation of linguistic identifiers pertaining spatial information and the content disambiguation. It will be cool to parse these messages looking for keywords like “there”, “here”, “over”, “crossing”, etc. and comparing the relative frequencies of these words with the frequencies of the same words within a collection of geographical messages like that of UrbanTapestries.

This will tell whether the strategies of messaging are different in the two settings. However before doing this I’ll be looking for a taxonomy of these semantic markers.

P.S.: a Corpus of 30.000 SMS messages in French was recently made available at the cost of ~300 euros.

Tags: data mining, ethnography, linguistics, Short Message Service, statistics, tagging, text data mining

Mauro Cherubini

Professor at the University of Lausanne, Switzerland

Leave a Reply Cancel reply