Experiments in Topic Distillation

Soumen Chakrabarti, Byron E. Dom, David Gibson, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins

IBM Research - Almaden, 650 Harry Road, San Jose, CA 95120.


We consider the problem of topic distillation: given a broad topic, distill a small number of high-quality web pages that are most representative of the topic. While the web grows exponentially with time, the amount of information that the human end-user can digest remains roughly constant. Our goal is thus not to index, search or classify all the pages that are possibly relevant to a topic, but only the most authoritative information on the requested subject. We describe recent experiments comparing topic resource lists compiled by our CLEVER system to Yahoo! and Altavista . Our results show that for distillation, CLEVER performs substantially better than Altavista, and surprisingly, typically performs better than the hand-built resources at Yahoo.

Search versus distillation

The scope of the web, and the diverse body of users, means that the same engines must service queries ranging from "What is pin 4 of a 74LS00 TTL NAND gate?" to "What is digital logic?" A term index such as Altavista can be useful for very specific queries of the former kind, especially for users experienced in query construction. Unfortunately, the situation for distillation queries such as the second is worse. As an example, consider the query "fishing." This text query matches more than a million pages on today's web. Top responses from the two largest search engines at the time of this writing tend to be advertising pages, often for companies located on individual lakes or regions. On the other hand, a number of excellent online fishing resources are not ranked highly. An engine for distillation (rather than search) should return (say) thirty great pages about fishing in general, filtering out the large number of irrelevant pages (e.g., pages containing "fishing for compliments"), low-usefulness pages (i.e., pages about "fishing in the south of Medford, North Dakota" which would be appropriate responses to a more specific query), and low-quality pages (i.e., the large number of advertising pages offering the same products for the same prices).

The CLEVER algorithm

CLEVER is an extension of HITS (see [Chakrabarti et al], [Kleinberg]). We begin with a quick overview of HITS, and then describe the extensions that led to CLEVER.


HITS produces two distinct but related types of pages in response to a query topic: hubs and authorities . Hubs and authorities exhibit a mutually reinforcing relationship: a good hub points to many good authorities; a good authority is pointed to by many good hubs (pages can be both good authorities and good hubs.). HITS proceeds as follows.

1. Starting from a user-supplied query, HITS assembles an initial set of pages: typically, up to 200 pages returned by a text search engine such as AltaVista on that query. These pages are then expanded to a larger root set by adding any pages that are linked to or from any page in the initial set.

2. HITS then associates with each page p a hub-weight h(p) and an authority weight a(p), all initialized to 1.

3. HITS then iteratively updates the hub and authority weights of each page in the root set as follows. First, under the intuition that a page pointing to good authorities should be considered a good hub, we replace the hub score of each page by the sum of the authorities of the pages it points to. And second, dually, under the intuition that a page pointed to by good hubs should be considered a good authority, we replace the authority score of each page by the sum of the hub scores of the pages that point to it.

4. The update operations are performed for all the pages, and the process repeated (normalizing the weights after each iteration) for some number of rounds. Following this, the pages with the highest h(p) and a(p) scores are output as the best hubs and authorities.

See [Chakrabarti et al], [Kleinberg] for more details on HITS and other link-based search methods.


CLEVER links pages together using more detailed information than simply whether a page links to another page. This section gives some intuition for the type of information that we incorporate. First, the connection from one page to another depends on a function of the particular query being serviced and the text that appears near the link; a rudimentary version of this modification was previously described in [Chakrabarti et al].

Next, in the course of developing CLEVER, we encountered a number of idiosyncracies of the web that caused us to modify the algorithm. We now describe these situations, and sketch the resultant modifications:

1. We assume that two pages on the same logical website were created by the same organization or individual, and so should not be allowed to confer authority upon one another. To address this, we vary the weights of the links between pages based on the domains of their end-points.

2. We seek a final set of hubs and authorities that provide good access to a wide variety of information. For instance, two pages that are extremely high quality but contain virtually identical information would not both be returned. To this end, after we output a hub, we diminish the scores of pages that are very similar to that hub.

3. We return only a single point-of-entry into a particular internet resource.

4. We often encounter situations in which a good hub page, for instance, appears with a different level of generality than the query for which it would be useful. As an example, consider the query "mango fruit." A high-quality hub page devoted to exotic fruit might have a section of links on papaya, another section on mango, and finally a section of guava. However, if we consider the page to be a universally good hub, the unrelated destination pages about papaya and guava will be considered to be good authorities. To address this, we identify interesting (physically contiguous) sections of web pages and use these sections to determine which other pages might be good hubs or authorities in their entirety.

User Study

We now describe a recent user study.

Performance Evaluation for the Web

Traditionally, retrieval systems are evaluated using the measures of precision and recall on a large pre-evaluated corpus. As there is no such standard benchmark for the web, we cannot take this approach directly. But the situation is worse than this:

1. The web currently contains around 270 million documents BB98. so rating the entire corpus is a daunting task. But we cannot label only a small subset because broad-topic queries routinely return a million page hits scattered around the web.
2. Even if it were possible to create such a corpus, a million new pages arrive daily, so the corpus could be used to evaluate actual search engines for only a brief window (probably shorter than the time to gather the relevance judgements) before too many new pages arrived.
3. The composition of a "typical" web document in terms of inlinks, outlinks, average size, graphical content, layout, function, etc., is dynamic. Even if we are not interested in comparisons that include actual search engines, and instead wish to label a snapshot of the web then compare algorithms on this snapshot, results for the web of a few years ago may not generalize to the web of today.
4. Existing labeled corpora such as TREC are primarily standalone text while we examine algorithms that rely fundamentally on the hyperlinked nature of the web. Even other hyperlinked corpora tend to contain documents of much higher quality than the web. So modifying existing labeled corpora for evaluating web-targeted algorithms is also difficult.
5. The closest approximations to "relevance judgements" on today's web are sites such as Yahoo!, which through human involvement collect high-quality pages on a number of topics. Unfortunately, due to the growth figures given above these sites cannot hope to index all pages on the web. If a search engine returns a page on a particular topic that is also indexed by Yahoo! under that topic, we have a strong indication that the page is high quality. If, however (as is more likely), Yahoo! does not index the page, we have no information about the quality of the page.


Because of these difficulties, we choose to evaluate our approach using relevance judgements (gathered through a user study) on a set of benchmark topics introduced in [Chakrabarti et al] and subsequently used in an independent study [BH98]. We therefore compare ourselves against the best-known automatic search engine, Altavista, and the best-known human-compiled resource site, Yahoo! We compute the precision of all three sources on a fixed number of pages according to our user-provided relevance judgements and compare these results. We refer to this technique as comparative precision.

More precisely, for each of our 27 broad-topic queries we extracted ten pages from each of our three sources. Altavista and Clever were both run on the query in the usual manner. The same search was entered manually into Yahoo!'s search engine, and of the resulting leaf nodes, the one best matching the query was picked by hand. If the best match contained too few links, the process was repeated to generate additional links. Using this procedure we took the top ten pages from Altavista, the top five hubs and five authorities returned by CLEVER, and a random ten pages from the most relevant node or nodes of Yahoo! (Yahoo! lists pages alphabetically and performs no ranking, hence the unbiased choice of ten random pages.) We then merged these three sets and sorted the resulting approximately 30 pages alphabetically (there are almost never duplicate pages from the three sources). We asked each user to rank each of these pages "bad," "fair," "good," or "fantastic" based on how useful the page would be in learning about the topic. We took good and fantastic pages to be relevant, noted the source (AltaVista/CLEVER/Yahoo) of each page, and then computed precision in the traditional manner. Since our users evaluated only the pages returned from our three sources, but did not know which source returned which page, we refer to this type of data as blind post-facto relevance judgements.

The subjective evaluation of relevance was performed by a set of 37 subjects, yielding 1369 datapoints. The subjects were required to have access to the web and familiarity with a common web browser, but were not experts in computer science, or on the topics. Each subject was free to browse the list of pages at leisure, visiting each page as many times as desired, before deciding on a final quality score.


Figure 1 summarizes the fractions of topics on which each of the three sources was adjudged the best in terms of comparative precision. In 31% of the cases, the difference between Yahoo! and Clever (tied for the best) was statistically insignificant; on fully 50% of the topics, Clever was adjudged the best. Figure 2 shows the average comparitive precision of each search engine over the set of 27 queries.

Figure 1: Summary of Results

Figure 2: Average precision of each engine over all queries.

Ongoing Work

A natural direction to extend our topic distillation system is the automatic compilation of hierarchical resource lists such as Yahoo! It is important to emphasize that we do not seek to automatically classify pages on the web into a taxonomy (a task undertaken by Yahoo! ontologists). Rather, we wish to use CLEVER to discover, for each node of the taxonomy, the best resources that the web has to offer for that topic. The benefits of such an ontology would be: (1) it can be compiled with minimal human effort; (2) it seeks the best current resources, rather than to classify submitted URLs; (3) being automatic, it can be run periodically (say daily) to maintain an up-to-date view of the topic. We are currently examining a number of issues and challenges in this regard: how can CLEVER exploit the structure of the topic hierarchy? can a partially-built taxonomy be boot-strapped into a complete taxonomy with some human intervention? what is the experience of ontologists administering such an automatic taxonomy builder?


K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public Web search engines. Proceedings of the 7th World-Wide Web conference, 1998.
K. Bharat and M. Henzinger. Improved Algorithms for Topic Distillation in Hyperlinked Environments. Proceedings ACM SIGIR 98.
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of the 7th World-Wide Web conference, 1998.
J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. ACM-SIAM Symposium on Discrete Algorithms (1998). Also available as IBM Research Report RJ 10076(91892) May 1997 (http://www.watson.ibm.com:8080/main-cgi-bin/search_paper.pl/entry_ids=8690 ) and as http://www.cs.cornell.edu/home/kleinber/auth.ps.