Marti A. Hearst

Professor

University of California, Berkeley

Research: TextTiling

    Preamble: The TextTiling work was done in the early 90's, when this text was written.

    TextTiling is a technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.

    Articles that are describe and inform -- such as science magazine articles and environmental impact reports -- can be viewed as being composed of a few main topics and a series of short, sometimes densely discussed, subtopics. For example, consider a 23-paragraph article whose main topic is the exploration of Venus by the Magellan space probe. A reader divided this text into the following segments, with the labels shown, where the numbers indicate paragraph numbers:

    • 1-2 Intro to Magellan space probe
    • 3-4 Intro to Venus
    • 5-7 Lack of craters
    • 8-11 Evidence of volcanic action
    • 12-15 River Styx
    • 16-18 Crustal spreading
    • 19-21 Recent volcanism
    • 22-23 Future of Magellan

    TextTiling is a method for partitioning full-length text documents into coherent multi-paragraph units, like those seen above, that correspond to a sequence of subtopical passages. The algorithm assumes that a set of words is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well.

    The approach uses quantitative lexical analyses to determine the segmentation of the documents. The tiles have been found to correspond well to human judgements of the major subtopic boundaries of science magazine articles.

    Papers

    • The quintessential TextTiling paper:

      Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf 

    • A long, and more detailed, journal article:

      Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics , 23 (1), pp. 33-64, March 1997. pdf

    • How to evaluate segmentation algorithms like TextTiling:

      Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics,, 28 (1), March 2002, pp. 19-36. pdf 

    • The use of TextTiling, and passage retrieval in general, to improve information retrieval:

      Marti A. Hearst and Christian Plaunt, Subtopic Structuring for Full-Length Document Access Proceedings of the 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993. pdf 

    • An earlier paper, presenting more preliminary work:

      Marti A. Hearst, TextTiling: A Quantitative Approach to Discourse Segmentation, Technical Report UCB:S2K-93-24, 1993. postscript

    Software