Marti A. Hearst
Computer Science Division, 571 Evans Hall
University of California, Berkeley
Berkeley, CA 94720
Xerox Palo Alto Research Center
This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes. Two fully-implemented versions of the algorithm are described and shown to produce segmentation that corresponds well to human judgments of the major subtopic boundaries of thirteen lengthy texts.
The structure of expository texts can be characterized as a sequence of subtopical discussions that occur in the context of a few main topic discussions. For example, a popular science text called Stargazers, whose main topic is the existence of life on earth and other planets, can be described as consisting of the following subdiscussions (numbers indicate paragraph numbers):
Subtopic structure is sometimes marked in technical texts by headings and subheadings which divide the text into coherent segments; brown83 state that this kind of division is one of the most basic in discourse. However, many expository texts consist of long sequences of paragraphs with very little structural demarcation. This paper presents fully-implemented algorithms that use lexical cohesion relations to partition expository texts into multi-paragraph segments that reflect their subtopic structure. Because the model of discourse structure is one in which text is partitioned into contiguous, nonoverlapping blocks, I call the general approach TextTiling. The ultimate goal is to not only identify the extents of the subtopical units, but to label their contents as well. This paper focusses only on the discovery of subtopic structure, leaving determination of subtopic content to future work.
Most discourse segmentation work is done at a finer granularity than that suggested here. However, for lengthy written expository texts, multi-paragraph segmentation has many potential uses, including the improvement of computational tasks that make use of distributional information. For example, disambiguation algorithms that train on arbitrary-size text windows, e.g., yarowsky92 and gale92c, and algorithms that use lexical co-occurrence to determine semantic relatedness, e.g., schuetze93, might benefit from using windows with motivated boundaries instead.
Information retrieval algorithms can use subtopic structuring to return meaningful portions of a text if paragraphs are too short and sections are too long (or are not present). Motivated segments can also be used as a more meaningful unit for indexing long texts. salton93, working with encyclopedia text, find that comparing a query against sections and then paragraphs is more successful than comparing against full documents alone. I have used the results of TextTiling in a new paradigm for information access on full-text documents [Hearst1994].
The next section describes the discourse model that motivates the approach. This is followed by a description of two algorithms for subtopic structuring that make use only of lexical cohesion relations, the evaluation of these algorithms, and a summary and discussion of future work.
Many discourse models assume a hierarchical segmentation model, e.g., attentional/intentional structure [Grosz & Sidner1986] and Rhetorical Structure Theory [Mann & Thompson1987]. Although many aspects of discourse analysis require such a model, I choose to cast expository text into a linear sequence of segments, both for computational simplicity and because such a structure is sufficient for the coarse-grained tasks of interest here.
Figure 1: Skorochod'ko's text structure types. Nodes correspond to sentences and edges between nodes indicate strong term overlap between the sentences.
skorochod72 suggests discovering a text's structure by dividing it up into sentences and seeing how much word overlap appears among the sentences. The overlap forms a kind of intra-structure; fully connected graphs might indicate dense discussions of a topic, while long spindly chains of connectivity might indicate a sequential account (see Figure 1). The central idea is that of defining the structure of a text as a function of the connectivity patterns of the terms that comprise it. This is in contrast with segmenting guided primarily by fine-grained discourse cues such as register change, focus shift, and cue words. From a computational viewpoint, deducing textual topic structure from lexical connectivity alone is appealing, both because it is easy to compute, and also because discourse cues are sometimes misleading with respect to the topic structure [Brown & Yule1983](§3).
The topology most of interest to this work is the final one in the diagram, the Piecewise Monolithic Structure, since it represents sequences of densely interrelated discussions linked together, one after another. This topology maps nicely onto that of viewing documents as a sequence of densely interrelated subtopical discussions, one following another. This assumption, as will be seen, is not always valid, but is nevertheless quite useful.
This theoretical stance bears a close resemblance to Chafe's notion of The Flow Model of discourse [Chafe1979], in description of which he writes (pp 179-180):
Our data ... suggest that as a speaker moves from focus to focus (or from thought to thought) there are certain points at which there may be a more or less radical change in space, time, character configuration, event structure, or, even, world. ... At points where all of these change in a maximal way, an episode boundary is strongly present. But often one or another will change considerably while others will change less radically, and all kinds of varied interactions between these several factors are possible.
Although Chafe's work concerns narrative text, the same kind of observation applies to expository text. The TextTiling algorithms are designed to recognize episode boundaries by determining where thematic components like those listed by Chafe change in a maximal way.
Figure 2: Distribution of selected terms from the Stargazer text, with a single digit frequency per sentence number (blanks indicate a frequency of zero).
Many researchers have studied the patterns of occurrence of characters, setting, time, and the other thematic factors that Chafe mentions, usually in the context of narrative. In contrast, I attempt to determine where a relatively large set of active themes changes simultaneously, regardless of the type of thematic factor. This is especially important in expository text in which the subject matter tends to structure the discourse more so than characters, setting, etc. For example, in the Stargazers text, a discussion of continental movement, shoreline acreage, and habitability gives way to a discussion of binary and unary star systems. This is not so much a change in setting or character as a change in subject matter. Therefore, to recognize where the subtopic changes occur, I make use of lexical cohesion relations [Halliday & Hasan1976] in a manner similar to that suggested by Skorochod'ko.
Morris and Hirst's pioneering work on computing discourse structure from lexical relations [Morris & Hirst1991], [Morris1988] is a precursor to the work reported on here. Influenced by halliday76b theory of lexical coherence, Morris developed an algorithm that finds chains of related terms via a comprehensive thesaurus (Roget's Fourth Edition). For example, the words residential and apartment both index the same thesaural category and can thus be considered to be in a coherence relation with one another. The chains are used to structure texts according to the attentional/intentional theory of discourse structure [Grosz & Sidner1986], and the extent of the chains correspond to the extent of a segment. The algorithm also incorporates the notion of ``chain returns'' - repetition of terms after a long hiatus - to close off an intention that spans over a digression.
Since the morris91 algorithm attempts to discover attentional/intentional structure, their goals are different than those of TextTiling. Specifically, the discourse structure they attempt to discover is hierarchical and more fine-grained than that discussed here. Thus their model is not set up to take advantage of the fact that multiple simultaneous chains might occur over the same intention. Furthermore, chains tend to overlap one another extensively in long texts. Figure 2 shows the distribution, by sentence number, of selected terms from the Stargazers text. The first two terms have fairly uniform distribution and so should not be expected to provide much information about the divisions of the discussion. The next two terms occur mainly at the beginning and the end of the text, while terms binary through planet have considerable overlap from sentences 58 to 78. There is a somewhat well-demarked cluster of terms between sentences 35 and 50, corresponding to the grouping together of paragraphs 10, 11, and 12 by human judges who have read the text.
From the diagram it is evident that simply looking for chains of repeated terms is not sufficient for determining subtopic breaks. Even combining terms that are closely related semantically into single chains is insufficient, since often several different themes are active in the same segment. For example, sentences 37 - 51 contain dense interaction among the terms move, continent, shoreline, time, species, and life, and all but the latter occur only in this region. However, it is the case that the interlinked terms of sentences 57 - 71 (space, star, binary, trinary, astronomer, orbit ) are closely related semantically, assuming the appropriate senses of the terms have been determined.
Many researchers (e.g., halliday76b, tannen89, walker91) have noted that term repetition is a strong cohesion indicator. I have found in this work that term repetition alone is a very useful indicator of subtopic structure, when analyzed in terms of multiple simultaneous information threads. This section describes two algorithms for discovering subtopic structure using term repetition as a lexical cohesion indicator.
The first method compares, for a given window size, each pair of adjacent blocks of text according to how similar they are lexically. This method assumes that the more similar two blocks of text are, the more likely it is that the current subtopic continues, and, conversely, if two adjacent blocks of text are dissimilar, this implies a change in subtopic flow. The second method, an extension of morris91 approach, keeps track of active chains of repeated terms, where membership in a chain is determined by location in the text. The method determines subtopic flow by recording where in the discourse the bulk of one set of chains ends and a new set of chains begins.
Figure 3: Judgments of seven readers on the Stargazer text. Internal numbers indicate location of gaps between paragraphs; x-axis indicates token-sequence gap number, y-axis indicates judge number, a break in a horizontal line indicates a judge-specified segment break.
Figure: Results of the block similarity algorithm on the Stargazer text. Internal numbers indicate paragraph numbers, x-axis indicates token-sequence gap number, y-axis indicates similarity between blocks centered at the corresponding token-sequence gap. Vertical lines indicate boundaries chosen by the algorithm; for example, the leftmost vertical line represents a boundary after paragraph 3. Note how these align with the boundary gaps of Figure 3 above.
The core algorithm has three main parts:
Tokenization refers to the division of the input text into individual lexical units. For both versions of the algorithm, the text is subdivided into psuedosentences of a pre-defined size w (a parameter of the algorithm) rather than actual syntactically-determined sentences, thus circumventing normalization problems. For the purposes of the rest of the discussion these groupings of tokens will be referred to as token-sequences. In practice, setting w to 20 tokens per token-sequence works best for many texts. The morphologically-analyzed token is stored in a table along with a record of the token-sequence number it occurred in, and how frequently it appeared in the token-sequence. A record is also kept of the locations of the paragraph breaks within the text. Closed-class and other very frequent words are eliminated from the analysis.
After tokenization, the next step is the comparison of adjacent pairs of blocks of token-sequences for overall lexical similarity. Another important parameter for the algorithm is the blocksize: the number of token-sequences that are grouped together into a block to be compared against an adjacent group of token-sequences. This value, labeled k, varies slightly from text to text; as a heuristic it is the average paragraph length (in token-sequences). In practice, a value of k=6 works well for many texts. Actual paragraphs are not used because their lengths can be highly irregular, leading to unbalanced comparisons.
Similarity values are computed for every token-sequence gap number; that is, a score is assigned to token-sequence gap i corresponding to how similar the token-sequences from token-sequence i-k through i are to the token-sequences from i+1 to i+k+1. Note that this moving window approach means that each token-sequence appears in k * 2 similarity computations.
Similarity between blocks is calculated by a cosine measure: given two text blocks and , each with k token-sequences,
where t ranges over all the terms that have been registered during the tokenization step, and is the weight assigned to term t in block . In this version of the algorithm, the weights on the terms are simply their frequency within the block. Thus if the similarity score between two blocks is high, then the blocks have many terms in common. This formula yields a score between 0 and 1, inclusive.
These scores can be plotted, token-sequence number against similarity score. However, since similarity is measured between blocks and , where spans token-sequences i-k through i and spans i+1 to i+k+1, the measurement's x-axis coordinate falls between token-sequences i and i+1. Rather than plotting a token-sequence number on the x-axis, we plot token-sequence gap number i. The plot is smoothed with average smoothing; in practice one round of average smoothing with a window size of three works best for most texts.
Boundaries are determined by changes in the sequence of similarity scores. The token-sequence gap numbers are ordered according to how steeply the slopes of the plot are to either side of the token-sequence gap, rather than by their absolute similarity score. For a given token-sequence gap i, the algorithm looks at the scores of the token-sequence gaps to the left of i as long are their values are increasing. When the values to the left peak out, the difference between the score at the peak and the score at i is recorded. The same procedure takes place with the token-sequence gaps to the right of i; their scores are examined as long as they continue to rise. The relative height of the peak to the right of i is added to the relative height of the peak to the left. (A gap occurring at a peak will have a score of zero since neither of its neighbors is higher than it.) These new scores, called depth scores, corresponding to how sharp a change occurs on both sides of the token-sequence gap, are then sorted. Segment boundaries are assigned to the token-sequence gaps with the largest corresponding scores, adjusted as necessary to correspond to true paragraph breaks. A proviso check is done that prevents assignment of very close adjacent segment boundaries. Currently there must be at least three intervening token-sequences between boundaries. This helps control for the fact that many texts have spurious header information and single-sentence paragraphs.
The algorithm must determine how many segments to assigned to a document, since every paragraph is a potential segment boundary. Any attempt to make an absolute cutoff is problematic since there would need to be some correspondence to the document style and length. A cutoff based on a particular valley depth is similarly problematic.
I have devised a method for determining the number of boundaries to assign that scales with the size of the document and is sensitive to the patterns of similarity scores that it produces: the cutoff is a function of the average and standard deviations of the depth scores for the text under analysis. Currently a boundary is drawn only if the depth score exceeds .
One way to evaluate these segmentation algorithms is to compare against judgments made by human readers, another is to compare the algorithms against texts pre-marked by authors, and a third way is to see how well the results improve a computational task. This section compares the algorithm against reader judgments, since author markups are fallible and are usually applied to text types that this algorithm is not designed for, and hearst94 shows how to use TextTiles in a task (although it does not show whether or not the results of the algorithms used here are better than some other algorithm with similar goals).
Judgments were obtained from seven readers for each of thirteen magazine articles which satisfied the length criteria (between 1800 and 2500 words) and which contained little structural demarkation. The judges were asked simply to mark the paragraph boundaries at which the topic changed; they were not given more explicit instructions about the granularity of the segmentation.
Figure 3 shows the boundaries marked by seven judges on the Stargazers text. This format helps illustrate the general trends made by the judges and also helps show where and how often they disagree. For instance, all but one judge marked a boundary between paragraphs 2 and 3. The dissenting judge did mark a boundary after 3, as did two of the concurring judges. The next three major boundaries occur after paragraphs 5, 9, 12, and 13. There is some contention in the later paragraphs; three readers marked both 16 and 18, two marked 18 alone, and two marked 17 alone. The outline in the Introduction gives an idea of what each segment is about.
passonneau93 discuss at length considerations about evaluating segmentation algorithms according to reader judgment information. As Figure 3 shows, agreement among judges is imperfect, but trends can be discerned. In passonneau93 data, if 4 or more out of 7 judges mark a boundary, the segmentation is found to be significant using a variation of the Q-test [Cochran1950]. My data showed similar results. However, it isn't clear how useful this significance information is, since a simple majority does not provide overwhelming proof about the objective reality of the subtopic break. Since readers often disagree about where to draw a boundary marking for a topic shift, one can only use the general trends as a basis from which to compare different algorithms. Since the goals of TextTiling are better served by algorithms that produce more rather than fewer boundaries, I set the cutoff for ``true'' boundaries to three rather than four judges per paragraph. The remaining gaps are considered nonboundaries.
Figure 4 shows a plot of the results of applying the block comparison algorithm to the Stargazer text. When the lowermost portion of a valley is not located at a paragraph gap, the judgment is moved to the nearest paragraph gap. For the most part, the regions of strong similarity correspond to the regions of strong agreement among the readers. (The results for this text were fifth highest out of the 13 test texts.) Note however, that the similarity information around paragraph 12 is weak. This paragraph briefly summarizes the contents of the previous three paragraphs; much of the terminology that occurred in all of them reappears in this one location (in the spirit of a grosz86 ``pop'' operation). Thus it displays low similarity both to itself and to its neighbors. This is an example of a breakdown caused by the assumptions about the subtopic structure. It is possible that an additional pass through the text could be used to find structure of this kind.
The final paragraph is a summary of the entire text; the algorithm recognizes the change in terminology from the preceding paragraphs and marks a boundary; only two of the readers chose to differentiate the summary; for this reason the algorithm is judged to have made an error even though this sectioning decision is reasonable. This illustrates the inherent fallibility of testing against reader judgments, although in part this is because the judges were given loose constraints.
Following the advice of gale92, I compare the algorithm against both upper and lower bounds. The upper bound in this case is the reader judgment data. The lower bound is a baseline algorithm that is a simple, reasonable approach to the problem that can be automated. A simple way to segment the texts is to place boundaries randomly in the document, constraining the number of boundaries to equal that of the average number of paragraph gaps assigned by judges. In the test data, boundaries are placed in about 41% of the paragraph gaps. A program was written that places a boundary at each potential gap 41% of the time (using a random number generator), and run 10,000 times for each text, and the average of the scores of these runs was found. These scores appear in Table 1 (results at 33% are also shown for comparison purposes).
The algorithms are evaluated according to how many true boundaries they select out of the total selected (precision) and how many true boundaries are found out of the total possible (recall) [Salton1988]. The recall measure implicitly signals the number of missed boundaries (false negatives, or deletion errors); the number of false positives, or insertion errors, is indicated explicitly.
In many cases the algorithms are almost correct but off by one paragraph, especially in the texts that the algorithm performs poorly on. When the block similarity algorithm is allowed to be off by one paragraph, there is dramatic improvement in the scores for the texts that lower part of Table 2, yielding an overall precision of 83% and recall of 78%. As in Figure 4, it is often the case that where the algorithm is incorrect, e.g., paragraph gap 11, the overall blocking is very close to what the judges intended.
Table 1: Precision and Recall values for 13 test texts.
Table 1 shows that both the blocking algorithm and the chaining algorithm are sandwiched between the upper and lower bounds. Table 2 shows some of these results in more detail. The block similarity algorithm seems to work slightly better than the chaining algorithm, although the difference may not prove significant over the long run. Furthermore, in both versions of the algorithm, changes to the parameters of the algorithm perturbs the resulting boundary markings. This is an undesirable property and perhaps could be remedied with some kind of information-theoretic formulation of the problem.
Table 2: Scores by text, showing precision and recall. (C) indicates the number of correctly placed boundaries, (I) indicates the number of inserted boundaries. The number of deleted boundaries can be determined by subtracting (C) from Total Possible.
This paper has described algorithms for the segmentation of expository texts into discourse units that reflect the subtopic structure of expository text. I have introduced the notion of the recognition of multiple simultaneous themes, which bears some resemblance to Chafe's Flow Model of discourse and Skorochod'ko's text structure types. The algorithms are fully implemented: term repetition alone, without use of thesaural relations, knowledge bases, or inference mechanisms, works well for many of the experimental texts. The structure it obtains is coarse-grained but generally reflects human judgment data.
Earlier work [Hearst1993] incorporated thesaural information into the algorithms; surprisingly the latest experiments find that this information degrades the performance. This could very well be due to problems with the algorithm used. A simple algorithm that just posits relations among terms that are a small distance apart according to WordNet [Miller et al.1990] or Roget's 1911 thesaurus (from Project Gutenberg), modeled after Morris and Hirst's heuristics, might work better. Therefore I do not feel the issue is closed, and instead consider successful grouping of related words as future work. As another possible alternative kozima93 has suggested using a (computationally expensive) semantic similarity metric to find similarity among terms within a small window of text (5 to 7 words). This work does not incorporate the notion of multiple simultaneous themes but instead just tries to find breaks in semantic similarity among a small number of terms. A good strategy may be to substitute this kind of similarity information for term repetition in algorithms like those described here. Another possibility would be to use semantic similarity information as computed in schuetze93, resnik93, or dagan93.
The use of discourse cues for detection of segment boundaries and other discourse purposes has been extensively researched, although predominantly on spoken text (see hirschberg93 for a summary of six research groups' treatments of 64 cue words). It is possible that incorporation of such information may provide a relatively simple way improve the cases where the algorithm is off by one paragraph.
This paper has benefited from the comments of Graeme Hirst, Jan Pedersen, Penni Sibun, and Jeff Siskind. I would like to thank Anne Fontaine for her interest and help in the early stages of this work, and Robert Wilensky for supporting this line of research. This work was sponsored in part by the Advanced Research Projects Agency under Grant No. MDA972-92-J-1029 with the Corporation for National Research Initiatives (CNRI), and by the Xerox Palo Alto Research Center.