Skip to main content
    United States [change]    Terms of use
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Web-related Research

Computer Science


Our recent work on the Web focuses on the following topics: web structure, data mining on the web, evolutionary aspects on the web, blogs, ranking and rank aggregation, and incremental crawling.

Web structure

Geospatial mapping.  Web pages may be organized, indexed, searched, and navigated along several different feature dimensions.  Kevin McCurley investigates different approaches to discovering geographic context for web pages, and describes a navigational tool for browsing web resources by geographic proximity.

Self-similarity.   Stephen Dill, Ravi Kumar, Kevin McCurley, Sridhar Rajagopalan, D. Sivakumar, and Andrew Tomkins present an extensive characterization of the graph structure of the web, with a view to enabling high-performance applications that make use of this structure.  In particular, they argue that the web emerges as the outcome of a number of essentially independent stochastic processes that evolve at various scales.  A striking consequence of this scale invariance is that the structure of the web is a fractal---cohesive sub-regions display the same characteristics as the web at large.

Locality, hierarchy, and bidirectionality.  The world wide web has been previously  observed to be a small world network in which nodes are clustered together.  Nadav Eiron and Kevin McCurley provide evidence, based on a crawl of over a billion pages, that such a clustering effect corresponds very closely to the hierarchical nature of the URLs.  They propose a new paradigm for models of the web that incorporates the hierarchical evolution and structure that is evident in the web.

  • Nadav Eiron, Kevin McCurley.  Locality, hierarchy, and birectionality in the web.  Second Workshop on Algorithms and Models for the Web-graph 2003

Text analysis

Semantic  annotation.  Stephen Dill, Nadav Eiron, David Gibson, Dan Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, and Jason Zien describe Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. They apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations.  They argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.

Compound documents.  Most text analysis is designed to deal with the concept of a "document," namely, a cohesive presentation of thought on a unifying subject.  Nadav Eiron and Kevin McCurley claim that the notions of "document" and "web page" are not synonymous, and that authors often tend to deploy documents as collection of URLs, which they call compound documents.  They present new techniques for identifying and working with such compound documents.


Template detection.  Ziv Bar-Yossef and Sridhar Rajagopalan formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets.  They describe three principles, which characterize the assumptions made by hypertext information retrieval and data mining systems, and show that templates are a major source of violation of these principles. As a consequence, basic ``pure'' implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.

Anti-aliasing.  Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins show that for postings on online bulletin boards, network-level security alone is inadequate to guarantee privacy. They consider the problem of reverse engineering the multiple aliases belonging to an individual based on text posted by those aliases in public fora such as bulletin boards, netnews, weblogs, or web pages. Their results show that a careful combination of similarity detection and a tailored clustering algorithm can be disturbingly effective at teasing out aliases from large collections of posters.

Trust propagation.  A network of people connected by directed ratings or trust scores, and a model for propagating those trust scores, is a fundamental building block in many of today's most successful e-commerce and recommendation systems.  R. Guha, Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins develop a framework of trust propagation schemes, each of which may be appropriate in certain circumstances, and evaluate the schemes on a large trust network consisting of 800K trust scores expressed among 130K people.  The work appears to be the first to incorporate distrust in a computational trust propagation setting.


Link evolution and incremental PageRank.  Steve Chien,Cynthia Dwork, Ravi Kumar, D. Sivakumar, and Dan Simon present a very efficient algorithm to incrementally compute good approximations to PageRank, as links evolve. This algorithm is both fast and yields excellent approximations to PageRank, even in light of large amounts of changes to the link structure. The algorithm derives intuition and partial justification from a rigorous monotonicity analysis of Markov chains, where they prove two monotonicity theorems---adding a link always causes the PageRank (both the value and the ordinal position) of the target to only increase.

  • Steve Chien, Cynthia Dwork, Ravi Kumar, D. Sivakumar, Dan Simon.  Link evolution: Analysis and algorithms. First Workshop on Algorithms and Models for the Web-graph 2002

Bursty evolution of Blogspace. Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins propose two new tools to address the evolution of hyperlinked corpora (like Blogspace, the space of weblogs). First, they define time graphs to extend the traditional notion of an evolving directed graph, thereby capturing link creation as a point phenomenon in time; second, they develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution.  They extend recent work of Kleinberg to discover dense periods of ``bursty'' intra-community link creation.

Decay.  Page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Ziv Bar-Yossef, Andrei Broder, Ravi Kumar, and Andrew Tomkins present a recursive algorithm for measuring decay. This algorithm can be easily computed with only limited resources, thereby making it practical for web page maintainers, ontologists, and individuals.

Information propagation.  Dan Gruhl, R. Guha, David Liben-Nowell, and Andrew Tomkins study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of Weblogs over time as an example domain.  They characterize and model this collection at both macroscopic and microscopic levels.  Based only on information about which topics are covered by which authors at which times, they present, validate, and employ an algorithm to induce the underlying propagation network, which learns three fundamental properties:  the stickiness of each individual topic, the likelihood that one author will adopt a topic from another, and the frequency with which one author reads the content of another.

Search and ranking

Rank aggregation.  Combining ranking results from various sources is a basic problem in several Internet applications: building meta-search engines, combining ranking functions, selecting documents from a database based on multiple criteria, and improving search precision through word associations. Cynthia Dwork, Moni Naor, Ravi Kumar, and D. Sivakumar have developed a set of algorithms for the rank aggregation problem . These algorithms have two novel features: they are designed to effectively combat spam, a serious problem in Web searches; and they provide a new way to perform multi-word or multi-criteria searches. 

Generalized HITS.  The Generalized HITS algorithm (GHITS), developed by David Gibson, can perform local HITS calculations in real time, based on a precomputed global singular value decomposition.  GHITS calculates page reputation based on a moderation systems philosophy: links express approval, but the approval is modulated by the status of the link creator. Moderators, of course, may link to many topics, and central to the algorithm is a parameter to prevent topic drift, which can occur when highly-rated moderators boost the reputation of off-topic pages. 

Intranet search.  The social forces that guide the development of intranets are quite different from those of the internet, and the determination of a "good answer'" for intranet search is quite different than on the Internet.  Ron Fagin, Ravi Kumar, Kevin McCurley, Jasmine Novak, D. Sivakumar, John Tomlin, and David Williamson propose a novel approach to the problem of intranet search, based on the use of rank aggregation. The approach allows the search-engine builder to examine the effects of different heuristics on ranking of web search results; this leads to an architecture that can be easily customized for deployment in a variety of scenarios.

  • Ronald Fagin, Ravi Kumar, Kevin McCurley, Jasmine Novak, D. Sivakumar, John A. Tomlin, David Williamson.  Searching the workplace webWWW 2003: 366-375

Anchor text.  It has been observed that anchor text in web documents is very useful in improving the quality of web text search for some classes of queries.  By examining properties of anchor text in a large intranet, Nadav Eiron and Kevin McCurley shed light on why this is the case.  They conduct experiments to investigate several aspects of anchor text, including their relationship to titles, the frequency of queries that can be satisfied by anchor text alone, and the homogeneity of results fetched by anchor text.

Storylines.  Ravi Kumar and D. Sivakumar, in joint work with Uma Mahadevan, propose a framework to tap the wealth of information hidden beyond the first page of search results.  Their framework is axiomatically developed and combinatorial in nature.  They develop simple and effective algorithms that discover storylines---windows that offer glimpses into interesting themes latent in the search results beyond the top 10.


Incremental crawler.  Jenny Edwards, Kevin McCurley, and John Tomlin outline the construction of an incremental crawler and describe an optimization model for controlling the crawl strategy.  The model makes no assumptions about the statistical behavior of web page changes, but rather uses an adaptive approach to maintain data on actual change rates which are in turn used as inputs for the optimization. Computational results show that there is no single objective---different, but equally plausible, objectives lead to conflicting optimal strategies----but, there are compromise objectives which lead to good strategies that are robust against a number of criteria.

    About IBMPrivacyContact