Preamble: The TextTiling work was done in the early 90's, when this text was written.
TextTiling is a technique for automatically subdividing texts into multi-paragraph units that represent passages, or subtopics.
Articles that are describe and inform -- such as science magazine articles and environmental impact reports -- can be viewed as being composed of a few main topics and a series of short, sometimes densely discussed, subtopics. For example, consider a 23-paragraph article whose main topic is the exploration of Venus by the Magellan space probe. A reader divided this text into the following segments, with the labels shown, where the numbers indicate paragraph numbers:
TextTiling is a method for partitioning full-length text documents into coherent multi-paragraph units, like those seen above, that correspond to a sequence of subtopical passages. The algorithm assumes that a set of words is in use during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well.
The approach uses quantitative lexical analyses to determine the segmentation of the documents. The tiles have been found to correspond well to human judgements of the major subtopic boundaries of science magazine articles.
Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces, NM, June, 1994. pdf
Hearst, M. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages, Computational Linguistics , 23 (1), pp. 33-64, March 1997. pdf
Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics,, 28 (1), March 2002, pp. 19-36. pdf
Marti A. Hearst and Christian Plaunt, Subtopic Structuring for Full-Length Document Access Proceedings of the 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993. pdf
Marti A. Hearst, TextTiling: A Quantitative Approach to Discourse Segmentation, Technical Report UCB:S2K-93-24, 1993. postscript