What does it mean to mine data from the Web, as opposed to on other sources of information? The Web contains a mix of many different data types, and so in a sense subsumes text data mining, database data mining, image mining, and so on. The Web contains additional data types not available in large scale before, including hyperlinks and massive amounts of (indirect) user usage information. Spanning across all these data types there is the dimension of time, since data on the Web changes over time. Finally, there is data that is generated dynamically, in response to user input and programmatic scripts.
I think the most interesting results will come from novel mixings of these different data types to achieve novel goals. For example, Jon Kleinberg of Cornell and Larry Page of Stanford have created algorithms to find pages that are ``authorities'' on particular topics, e.g., the authority pages for the Java programming language or the Jaguar car. Kleinberg's algorithms use link information primarily (after an initial keyword search). A problem with this approach is that it does not necessarily find authority pages for less popular topics centering around the same words, e.g., the feline jaguar or the island of Java. If text clustering were applied first to find the main topics among a large set of documents for the given keyword, and the link algorithm applied afterwards to find the authority pages within each sense of the keyword, results could most likely be improved.
Any discussion of data mining from the Web requires a discussion of issues of scale. If an operation is performed across all the data on the Web, then it must scale to the size of the Web. If not performed over the entire Web, is it Web data mining or just text data mining, database data mining, etc? One could argue that extracting information from Web pages, even a very small subset of all Web pages, is an instance of Web data mining, because the format and types of data that appear on the Web are different in general than other kinds of data, as discussed above.
One last point that is near and dear to me, as an information access
researcher, is that it is important to differentiate between text data
mining and information access. Text data mining involves extracting
``nuggets'' and/or overall patterns from a collection of textual
information, independent of a users' information need, whereas
information access is the process of helping users find, create, use,
re-use, and understand information to satisfy an information need. In
other words, data mining is opportunistic, whereas information access
is goal-driven. For some purposes this difference may not matter, but
for others it probably does. However, I would like to see in future
information access systems that have, as one toolset among many
different toolsets, those that support a kind of exploratory data
analysis among a subcollection of documents.
Table of Contents
Home Page: www.sims.berkeley.edu/~hearst