Text Data Mining

Issues, Techniques, and the Relationship to Information Access

Marti A. Hearst
July, 1997

The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automaticallly. For example, it is much more difficult to graphically display textual content than quantitative data.

In this presentation, I will discuss important properties of text, and will introduce some of the powerful text analysis techniques that have been developed over the last few years by the empirical computational linguistics community. I will illustrate these techniques with text application subproblems (such as information extraction) that should be especially relevant to text data mining. Finally, I will spend a few minutes on the difference between text data mining and information retrieval, and describe the advances in information access that should be relevant to text data mining in future.


Click here to start



Click here for text version



Table of Contents

Text Data Mining

Outline

Background

Text DM != IR

Information Access (Information Retrieval more broadly construed)

Web != Text

Ore-Filled Text Collections

Why Text is Tough

Why Text is Tough

Why Text is Easy

Recent Trends in NLP (CL)

Stupid Text Tricks

There’s One in Every Crowd (Zipf’s Law)

Text Analysis (CL) Tasks

Transformation Rules

Transformation Rules Example

Observation

Text Data Mining Tasks

Text “Data Cleaning”

Question Answering Murax, Kupiec, SIGIR ‘93

Information Extraction

Information Extraction Example

Information Extraction: Learning Lexico-Syntactic Patterns Autoslog-TS, Riloff, AAAI ‘96

Summarizing

Summarizing (Heuristic) Features Kupiec et al.

Summary of Summary Paper Kupiec, Pedersen, and Chen, SIGIR 94

Text Categorization

What Categories Do

How to Use Text Categories

Scatter/Gather Clustering Cutting, Pedersen,Karger, Tukey 92,93

New use: Organize Retrieval Results Hearst et al 95, Hearst & Pedersen 96

S/G Example: query on “star”

An Example TDM System Dagan, Feldman, and Hirsh, SDAIR ‘96 (But not really mining over TEXT)

Visualization of Text Characteristics

Text Visualization

Text Visualization

Text Visualization (Summary)

Needed from Systems

Needed from Viz/CL

How to Deal with the Web

Summary and Future

Author: hearst

Email: hearst@sims.berkeley.edu

Home Page: www.sims.berkeley.edu/~hearst

Other information:
Presentation for UW/MS Workshop on Data Mining, July 1997