Distinguishing between Web Data Mining and Information Access

Presentation for the Panel on Web Data Mining KDD 97
Position Statement by Marti Hearst
August 16, Newport Beach, CA

What does it mean to mine data from the Web, as opposed to on other sources of information? The Web contains a mix of many different data types, and so in a sense subsumes text data mining, database data mining, image mining, and so on. The Web contains additional data types not available in large scale before, including hyperlinks and massive amounts of (indirect) user usage information. Spanning across all these data types there is the dimension of time, since data on the Web changes over time. Finally, there is data that is generated dynamically, in response to user input and programmatic scripts.

I think the most interesting results will come from novel mixings of these different data types to achieve novel goals. For example, Jon Kleinberg of Cornell and Larry Page of Stanford have created algorithms to find pages that are ``authorities'' on particular topics, e.g., the authority pages for the Java programming language or the Jaguar car. Kleinberg's algorithms use link information primarily (after an initial keyword search). A problem with this approach is that it does not necessarily find authority pages for less popular topics centering around the same words, e.g., the feline jaguar or the island of Java. If text clustering were applied first to find the main topics among a large set of documents for the given keyword, and the link algorithm applied afterwards to find the authority pages within each sense of the keyword, results could most likely be improved.

Any discussion of data mining from the Web requires a discussion of issues of scale. If an operation is performed across all the data on the Web, then it must scale to the size of the Web. If not performed over the entire Web, is it Web data mining or just text data mining, database data mining, etc? One could argue that extracting information from Web pages, even a very small subset of all Web pages, is an instance of Web data mining, because the format and types of data that appear on the Web are different in general than other kinds of data, as discussed above.

One last point that is near and dear to me, as an information access researcher, is that it is important to differentiate between text data mining and information access. Text data mining involves extracting ``nuggets'' and/or overall patterns from a collection of textual information, independent of a users' information need, whereas information access is the process of helping users find, create, use, re-use, and understand information to satisfy an information need. In other words, data mining is opportunistic, whereas information access is goal-driven. For some purposes this difference may not matter, but for others it probably does. However, I would like to see in future information access systems that have, as one toolset among many different toolsets, those that support a kind of exploratory data analysis among a subcollection of documents.

Click here to start


Table of Contents

Distinguishing between Web Data Mining and Information Access Presentation for the Panel on Web Data Mining KDD 97 August 16, Newport Beach, CA

Definitions

Definitions

Definitions

Web DM is not IA

DM:KDD as TA:IA

KDD vs. IA

DM Application Types (CACM 39 (11) Special Issue)

Unknown vs. Hard to Find

Real Web DM

Text Analysis as Data Mining

DM vs. Text Analysis

Using Results of DM

Combining Data Types for Novel Tasks

A New Take on Information Access

Author: hearst

Email: hearst@sims.berkeley.edu

Home Page: www.sims.berkeley.edu/~hearst

Other information:
Presentation for Web Data Mining Panel at KDD '97