Marti A. Hearst
July, 1997
The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automaticallly. For example, it is much more difficult to graphically display textual content than quantitative data.
In this presentation, I will discuss important properties of text, and will introduce some of the powerful text analysis techniques that have been developed over the last few years by the empirical computational linguistics community. I will illustrate these techniques with text application subproblems (such as information extraction) that should be especially relevant to text data mining. Finally, I will spend a few minutes on the difference between text data mining and information retrieval, and describe the advances in information access that should be relevant to text data mining in future.