Untangling Text Data Mining

M. A. Hearst. Untangling text data mining. In Proceedings of the ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics. University of Maryland, June 20-26 1999. [url]

Mining implies extracting precious nuggets of ore from otherwise worthless rock. This paper suggest that data mining is often confused with information retrieval  or information access.

This search-centriv view misses the point that we can actually extract new, never encoutered information from the data.

The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically. Perhaps for this reason, there has been little work in text data mining to date, and most people who have talked about it have either conflated it with information access or have not made use of text directly to discover heretofore unknown information. In this paper I will first define data mining, information access, and corpus-based computational linguistics, and then discuss the relationship of these to text data mining. The intent behind these contrasts is to draw attention to exciting new kinds of problems for computational linguists. I describe examples of what I consider to be real text data mining efforts and briefly outline our recent ideas about how to pursue exploratory data analysis over text.

Tags:

Leave a Reply