Marti A. Hearst

Professor

University of California, Berkeley

Research: Scatter/Gather

    Preamble: The Scatter/Gather project took place at Xerox PARC in the 1990's as a team project. My contribution focused on the use of S/G in search interfaces. The text below was written between 1994--1997.

    When a user enters a query into a search engine, the system often brings back many different pages. What should be done with all of these search results? How can we help the user understand what the system brought back and how it is related to the query? One solution we have explored is to show the user how the terms in the query are related to the words in the document. This solution is embodied in the TileBars interface.

    Another strategy is to organize the documents into meaningful groups.

    There are many different ways we can show how a set of documents are related to one another. One way is group together all documents written by the same author, or all documents written in the same year, or published by the same publisher. We can group according to subject matter as well. Libraries organize some of their information this way, using classification systems like the Dewey Decimal. For example, in the library, books about European history are placed in one location, and books on computer science in another.

    This kind of system is useful, but only goes so far. One problem is that if there are, say, 2000 books on European History, they need to be subdivided into more fine-grained categories in order to be manageable for browsing. However, the topic of European History can be subdivided in many ways, for instance, by nation, by economic changes, by political movements, by cultural characteristics, and so on. And of course, all of these topics are inter-related.

    It is necessary in the library to place books under one subject heading most of the time, because books are physical objects and can appear in only one place at a time. The computer gives us the extra power of allowing us to group books and other documents in many different ways.

    Another problem with assigning documents to single categories within a hierarchy (as seen in, for example, Yahoo) is that most documents discuss several different topics simultaneously. Although real-life objects can often be assigned one place in a taxonomy (a truck is a kind of vehicle) textual documents are not so simply classified. Text consists of abstract discussions of ideas and their inter-relationships. It is a rare document that is only about trucks; instead, a document might discuss recreational vehicles, or the manufacturing of recreational vehicles, or for that matter the trends in manufacturing of American recreational vehicles in Mexico before vs. after the NAFTA agreement. The tendency in building taxonomic hierarchies is to create ever-more-specific categories to handle cases like this. A better solution is to describe documents by a set of categories, as well as attributes (such as source, date, genre, and author), and provide good interfaces for manipulating these labels.

    As an alternative, the Scatter/Gather interface uses text clustering as a way to group document according to the overall similarities in their content. Scatter/Gather is so named because it allows the user to scatter documents into clusters, or groups, then gather a subset of these groups and re-scatter them to form new groups.

    Each cluster in Scatter/Gather is represented by a list of topical terms, that is, a list of words that attempt to give the user the gist of what the documents in the cluster are about. The user can also look at the titles of the documents in each group. The documents can in the cluster can have other representations as well, such as summaries, or TileBars.

    If a cluster still has too many documents, the user can re-cluster the documents in the cluster; that is, re-group that subset of documents into still smaller groups. This re-grouping process tends to change the kinds of themes of the clusters, because the documents in a subcollection discuss a different set of topics than all the documents in the larger collection.

    The best way to understand how Scatter/Gather works is to look at a few examples.

    A Scatter/Gather Example

    Here we demonstrate the use of Scatter/Gather on a collection of encyclopedia articles. Our query is very simple:

    Retrieve the top 250 documents that contain the word star .

    Here we show that Scatter/Gather text clustering does a reasonably good job at organizing the documents into meaningful themes or topics.

    We ask Scatter/Gather to place the 250 documents into 5 groups. Here is what results. (Bear in mind that encyclopedia articles are well-written and uniform format. The next example shows the results of a more complicated query on a more unruly text collection.)

    Shown here are the clusters' sizes (how many documents they contain), a list of topical terms, and a list of document titles. One can see from the topical terms of Cluster 1 that this cluster contains documents that involve stars as symbols, as in military rank and patriotic songs.

    Cluster 2 has 68 documents that appear mainly to be about movie and tv stars.

    Cluster 3 contains 97 documents that having to do with aspects of astrophysics.

    Cluster 4 contains 67 documents also about astronomy and astrophysics. This cluster contains many articles about people who are astronomers (this is apparent when the list is scrolled down).

    Cluster 5 contains all the articles that discuss animals or plants, and that happen to contain the word star, for example, star fish.

    If we ask Scatter/Gather to re-cluster the 68 documents that appear in Cluster 2, the one that discusses movie and tv stars, and place the results into three clusters, we see the following clusters:

    This re-clustering reveals that in actuality this cluster had more kinds of documents than we originally thought, based on the topical terms. These three clusters can be rather neatly summarized as containing articles about (Cluster 1) people who are sports stars, (Cluster 2) stars of film, tv, and theatre, and (Cluster 3) musicians.

    Now if we back up a step and re-cluster Cluster 3 from the original set, placing the results into four clusters, we see the following:

    The contents of these four clusters can be glossed as general astrophysics, galaxies and stars, constellations, and a cluster of leftover, or outlying documents.

    This example suggests the potential power of the system for automatically grouping documents according to themes. It also shows some issues that remain to be addressed. First, we need to determine automatically what the best number of clusters is at each phase. Currently we have the user make the decision of how many clusters to show for each document subcollection. We are working on how to make this choice automatically, based on the characteristics of the subcollection. Second, sometimes the summary is misleading or incomplete in terms of what documents are to be found in the cluster. We saw this with the cluster about film and tv stars -- it also contained documents about sports and music stars, although these were in the minority. We are working on determining how to indicate to the user when there are hidden topic areas in the cluster.

    Another Example

    Here we demonstrate the use of Scatter/Gather on the TIPSTER collection of over 1 million newswire, newspaper, magazine and government articles, dating mainly from the late 1980's. We also make use of one of the TREC queries, which have been evaluated by a set of judges who have determined which documents are relevant for each query. In this example, the task is to find all documents that discuss the following (abbreviated) topic:

    Topic 87: Criminal Actions Against Officers of Failed Financial Institutions

    We formulated a query containing the terms

    bank financial institution failed criminal officer indictment

    and instructed the system to retrieve the 500 top-ranked documents according to a vector space weighting (see below). Out of these 500 retrieved documents, only 21 had been judged relevant to the query by the TREC judges (some may not have been judged at all, but for the purposes of this example, those with no judgement are simply considered to be not relevant). These documents were not ranked especially high by the vector space measure: none of the documents judged relevant appeared in the top 10, only one appeared in the top 20, and only four appeared in the top 40.

    A tool that can guide the user towards the relevant subgroups would indeed be useful. Here we show that the Scatter/Gather tool can be effective in this way. The system is instructed to gather the 500 documents into five clusters; below are shown the resulting sizes and topical terms:

    Cluster 4 stands out for the purposes of the query in that it contains terms pertaining to fraud, investigation, lawyers, and courts. Note that in a general corpus these terms might not be descriptive for this query since the user would assume the documents were about legal issues in general. However, since we know the system has retrieved documents that also pertain to financial institutions, we can assume that the legal terms occur in the context of financial documents.

    The topical terms for Cluster 5 are less compelling, for they have only one term corresponding to a crime and seem to clearly indicate documents discussing the scandal involving the leader of the Phillipines in the late 80's. The topical terms of Cluster 0 are very general (and there are only four documents in the cluster which can be quickly scanned). It appears that Cluster 1 contains very general documents that do not fit into any of the other clusters particularly well, whereas Cluster 5 contains documents that relate to a very specific allegation of fraud.

    Cluster 2 is also compelling in that its summary contains many financial terms; however, it is less promising than Cluster 4 in that it seems more related to assets and risk assessment than criminal charges and failed banks. Finally, Cluster 3 seems related most strongly to rules and regulations, rather than indictments and fraud. Note again that this cluster, if taken out of context, might seem to refer to government regulations in general; however, since it was generated as the results of a query on financial terms, like Cluster 3, it most likely contains documents discussing rules and regulations on financial matters. A re-scattering of the cluster confirmed this suspicion.

    Based on this assessment, Cluster 4 looks most promising. If the user re-scatters (or re-clusters) it, five new clusters are produced:

    The user might choose the first, third and fourth clusters since their topical terms all seem to pertain to the topic of interest. Clusters 3 and 4 are especially compelling since they contains terms pertaining both to finance and to criminal proceedings. Cluster 3 has more terms about conviction but 4 has more terms pertaining to failure and the kinds of financial institutions that the user may have known to have failed; namely S&L's and thrifts. As it turns out, the clusters' contents reflect these observations: Cluster 3 contains mainly articles about indictments pertaining to financial fraud involving securities and stocks, but not failed banks. The user can view the contents of a cluster in ranked order, according to the score generated by the similarity search, or can view the documents according to some other search tool. Based on the topical terms, the most promising looking clusters are 1, 3, and 4.

    It turns out that Cluster 1 has one relevant document out of four. The third cluster, Cluster 3, has one relevant document out of 28. High ranked documents in this group include indictments for other crimes, some of which are financial in nature:

    Cluster 4 has eleven relevant documents out of 29:

    The second cluster also contains two relevant documents. Most other documents in this cluster discuss indictments for money laundering along with one article involving Noriega and another on a teen scandal in San Francisco. The last cluster has no relevant documents although it has several that discuss the BCCI.

    So it turns out that the top-level Cluster 4 contains 15 of the 21 relevant documents. The remaining 6 relevant documents are found exclusively in Cluster 2. When this cluster, which contains 187 documents, is scattered:

    all six relevant documents appear in one cluster, Cluster 3, of size 88:

    This example shows the tendency of relevant documents to tend to clump into a few clusters. If users can determine which clusters are of the most interest, they will then look through a higher density of relevant documents than if the documents were simply shown in ranked order.

    (This hypothesis has been supported with some experimental work; see the paper Re-examining the Cluster

    Hypothesis for more details.) The collection used here is the TREC/Tipster collection, provided by NIST.

    In the vector space weighting approach, championed by Gerry Salton, Chris Buckley, and other members of the SMART group at Cornell, documents and queries are represented as weighted vectors, where each word in the document corresponds to one position in the vector. Documents are ranked according to a normalized inner-product between the query and the document's represention. Thus, not all the query terms need be present in the document in order for the document to be retrieved, and a document that has many instances of a highly weighted word might be ranked higher than a document with a few instances of many of the words.

    Papers About Scatter/Gather

    Several technical papers have been published about Scatter/Gather.

    • Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections

      The first paper. Describes how to cluster an entire document collection hierarchically, and allow the user to gather clusters at different levels and rescatter them. Implementation details are provided.

      Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen, 1992. postscript (209K)

    • Constant Interaction-Time Scatter/Gather Browsing of Large Document Collections

      Describes performance improvements for the system presented in the first paper.

      Douglass Cutting, David Karger, and Jan Pedersen. Constant Interaction-Time Scatter/Gather Browsing of Large Document Collections. Proceedings of the 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993.

    • Scatter/Gather Browsing Communicates the Topic Structure of a Very large Text Collection

      Describes an evaluation of the effectiveness of the system browsing the contents of a large collection.

      Peter Pirolli, Patricia Schank, Marti A. Hearst, and Christine Diehl, Scatter/Gather Browsing Communicates the Topic Structure of a Very large Text Collection, Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), May 1996. html

    • Scatter/Gather as a Tool for the Navigation of Retrieval Results

      The first paper discussing the idea of using Scatter/Gather on retrieval results, that is, documents retrieved as the result of a query. Anecdotes only.

      Marti A. Hearst, David R. Karger, and Jan O. Pedersen, Scatter/Gather as a Tool for the Navigation of Retrieval Results , The proceedings of the 1995 AAAI Fall Symposium on Knowledge Navigation.) ps

    • Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results

      Presents the discovery that when clustering is applied to documents retrieved as the result of a query, most of the relevant documents tend to appear in one of the clusters. This implies that picking the best cluster will find most of the relevant documents. This observation verified with experimentation.

      Marti Hearst and Jan Pedersen, Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, Proceedings of the 19th Annual International ACM/SIGIR Conference, Zurich, August 1996. pdf 

    • Four TREC-4 Tracks: the Xerox Site Report

      Anecdotal description of the users responses to Scatter/Gather as part of an interface that uses TileBars as well.

      Marti Hearst, Jan Pedersen, Peter Pirolli, Hinrich Schuetze, Gregory Grefenstette, and David Hull, Four TREC-4 Tracks: the Xerox Site Report, Proceedings of the Fourth Text REtrieval Conference (TREC-4), Nov 1-3, Arlington, VA, 1996. pdf  

    • Videotape: Revealing Collection Structure through Information Access Interfaces

      A videotape that shows both Scatter/Gather and TileBars.

      Marti Hearst and Jan Pedersen, Revealing Collection Structure through Information Access Interfaces, Video track on the Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, Volume II, 2047-2048, 1995.

    Related work by others.

    • Tailoring a Retrieval System for Naive Users

      Describes an experiment comparing several systems with different approaches to showing clusters. Finds that an approach that shows the contents of the titles, as does Scatter/Gather, works better for naive users than systems that show clusters graphically.

      Adrienee J. Kleiboemer and Manette B. Lazear and Jan O. Pedersen, Tailoring a Retrieval System for Naive Users , Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (SDAIR), Las Vegas, NV, 1996.