|
Our recent work
on the Web focuses on the following topics: web structure, data mining
on the web, evolutionary aspects on the web, blogs, ranking and rank
aggregation, and
incremental crawling.
Web structure
Geospatial mapping.
Web
pages may be organized, indexed, searched, and navigated along several
different feature dimensions. Kevin McCurley investigates
different approaches
to discovering geographic context for web pages, and describes
a navigational
tool for browsing web resources by geographic proximity.
Self-similarity.
Stephen Dill, Ravi Kumar,
Kevin
McCurley, Sridhar Rajagopalan, D. Sivakumar, and Andrew Tomkins present
an
extensive characterization of the graph structure of the web, with a
view
to enabling high-performance applications that make use of this
structure.
In particular, they argue that the web emerges as the outcome of a
number of
essentially independent stochastic processes that evolve at various
scales.
A striking consequence of this scale invariance is that the structure
of the
web is a fractal---cohesive
sub-regions display the same characteristics as the
web
at large.
Locality, hierarchy, and
bidirectionality. The world wide web has been
previously observed to be a small world network in which nodes
are clustered together.
Nadav Eiron and
Kevin McCurley provide evidence, based on a crawl of over a billion
pages, that such a clustering effect corresponds very closely to the
hierarchical nature of the URLs. They propose a new paradigm for
models of the web that incorporates the hierarchical evolution and
structure that is evident in the web.
-
Nadav Eiron, Kevin McCurley.
Locality,
hierarchy, and birectionality in the web. Second Workshop on Algorithms and Models
for the Web-graph 2003
Text analysis
Semantic
annotation. Stephen Dill,
Nadav Eiron,
David Gibson, Dan
Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo,
Sridhar Rajagopalan, Andrew Tomkins,
John A. Tomlin, and Jason Zien describe Seeker, a platform for large-scale
text
analytics, and SemTag, an
application written on the platform to
perform automated semantic tagging of large corpora. They apply
SemTag to a collection of approximately 264 million web pages, and
generate approximately 434 million automatically disambiguated
semantic tags, published to the web as a label bureau providing
metadata regarding the 434 million annotations. They argue that
automated large scale semantic tagging of ambiguous
content can bootstrap and accelerate the creation of the semantic
web.
-
Stephen Dill,
Nadav Eiron, David Gibson,
Daniel Gruhl, R. Guha, Anant Jhingran,
Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins,
John A. Tomlin, Jason Y. Zien. SemTag
and Seeker: Bootstrapping the Semantic
Web via Automated Semantic Annotation. WWW 2003: 178-186
Compound
documents. Most text analysis is designed to deal with
the concept of a "document," namely, a cohesive presentation of thought
on a unifying subject.
Nadav Eiron and
Kevin McCurley claim that the notions of "document" and "web page" are
not synonymous, and that authors often tend to deploy documents as
collection of URLs, which they call compound
documents. They present new techniques for identifying and
working with such compound documents.
Mining
Template detection.
Ziv Bar-Yossef and
Sridhar Rajagopalan formulate and propose the template detection
problem, and suggest a practical solution for it based on counting
frequent
item
sets. They describe three principles, which characterize the
assumptions
made by hypertext information retrieval and data mining systems, and
show
that templates are a major source of violation of these principles. As
a consequence,
basic ``pure'' implementations of simple search algorithms coupled with
template
detection and elimination show surprising increases in precision at all
levels
of recall.
Anti-aliasing. Jasmine
Novak, Prabhakar Raghavan, and Andrew Tomkins show that for postings on online bulletin boards,
network-level
security alone is inadequate to guarantee privacy. They consider the
problem of reverse engineering the multiple aliases belonging to an
individual based on text posted by those aliases in public fora such as
bulletin boards, netnews, weblogs, or web pages. Their results
show that a careful combination of similarity detection and a tailored
clustering algorithm can be disturbingly effective at teasing out
aliases from large collections of posters.
Trust propagation.
A network of people connected by directed ratings or trust scores, and
a model for propagating those trust scores, is a fundamental building
block in many of today's most successful e-commerce and recommendation
systems. R. Guha, Ravi
Kumar, Prabhakar Raghavan,
and Andrew Tomkins develop a framework of trust propagation schemes,
each of which may be appropriate in certain circumstances, and
evaluate the schemes on a large trust network consisting of 800K trust
scores expressed among 130K people. The
work appears to be the first to incorporate distrust in a computational
trust propagation setting.
Evolution
Link evolution and
incremental PageRank. Steve
Chien,Cynthia
Dwork, Ravi Kumar, D. Sivakumar, and Dan Simon
present a very
efficient algorithm to incrementally compute good approximations to
PageRank, as links evolve. This
algorithm is both fast and yields excellent approximations to PageRank,
even in light of large amounts of changes to the link structure. The
algorithm derives intuition and partial justification from a rigorous monotonicity
analysis of Markov chains, where they prove two
monotonicity theorems---adding a link always causes the PageRank (both
the value and the ordinal position) of the target to only increase.
-
Steve Chien, Cynthia Dwork, Ravi Kumar, D.
Sivakumar, Dan Simon.
Link
evolution: Analysis and algorithms. First Workshop on Algorithms and Models
for the Web-graph 2002
Bursty evolution of
Blogspace. Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and
Andrew Tomkins propose two new tools to address the evolution of
hyperlinked corpora (like Blogspace,
the space of weblogs). First, they define time
graphs to extend the traditional notion of an evolving directed
graph, thereby capturing link creation as a point phenomenon in time;
second, they develop definitions and algorithms for time-dense
community tracking, to crystallize the notion
of community evolution. They extend recent
work of Kleinberg to discover dense periods of ``bursty''
intra-community link creation.
Decay.
Page creators are faced with an increasingly
burdensome task of keeping links up-to-date, and many are falling
behind. In addition to just individual pages, collections of pages or
even entire neighborhoods of the web exhibit significant decay,
rendering them less effective as information resources. Ziv Bar-Yossef,
Andrei Broder, Ravi Kumar, and Andrew Tomkins present a recursive
algorithm for measuring decay. This
algorithm can be easily computed with only limited resources, thereby
making it
practical for web page maintainers, ontologists, and individuals.
Information propagation. Dan Gruhl, R.
Guha, David Liben-Nowell,
and Andrew Tomkins study the dynamics of information
propagation in environments of low-overhead personal publishing,
using
a large collection of Weblogs over time as an example domain.
They characterize and model this collection at both macroscopic and
microscopic levels. Based only on information about
which topics are covered by which authors at which times, they
present, validate, and employ an algorithm to induce the
underlying propagation network, which learns three fundamental
properties: the stickiness of each individual topic, the
likelihood that one author will adopt a topic from another, and the
frequency with which one author reads the content of another.
Search
and ranking
Rank aggregation.
Combining ranking results from
various sources is a basic problem in several Internet applications:
building meta-search engines, combining ranking functions, selecting
documents from a database based on multiple criteria, and improving
search precision through word associations.
Cynthia Dwork,
Moni
Naor, Ravi
Kumar, and D. Sivakumar have developed a set of algorithms
for the rank aggregation problem . These algorithms have two
novel
features: they are designed to effectively combat spam, a serious
problem
in Web searches; and they provide a new way to perform multi-word or
multi-criteria
searches.
Generalized HITS.
The Generalized HITS
algorithm (GHITS), developed by David Gibson, can perform
local HITS calculations in real time, based on a precomputed global
singular value decomposition. GHITS calculates page reputation
based on a moderation
systems philosophy: links express approval, but the approval is
modulated by the status of the link creator. Moderators, of course, may
link to many topics, and central to the algorithm is a parameter to
prevent topic drift, which can occur when highly-rated moderators boost
the reputation of off-topic pages.
Intranet search.
The social
forces that guide the development of intranets
are quite different from those of the internet, and
the determination of a "good answer'" for intranet search is quite
different than on the Internet. Ron Fagin, Ravi Kumar, Kevin
McCurley, Jasmine Novak, D. Sivakumar, John Tomlin, and David Williamson propose
a novel approach to the problem of
intranet search, based on the use of rank aggregation. The approach
allows the search-engine builder to examine the effects of different
heuristics on ranking of web search results; this leads to an
architecture that can be easily customized for deployment in a variety
of scenarios.
-
Ronald Fagin, Ravi Kumar, Kevin McCurley,
Jasmine Novak, D. Sivakumar, John A. Tomlin, David Williamson. Searching the
workplace web. WWW 2003:
366-375
Anchor
text. It has been observed that anchor text in web
documents is
very useful in improving the quality of web text search for some
classes of queries. By examining properties of anchor text in a
large intranet, Nadav
Eiron and Kevin McCurley shed light on why this is the case.
They conduct experiments to investigate several aspects of anchor text,
including their relationship to titles, the frequency of queries that
can be satisfied by anchor text alone, and the homogeneity of results
fetched by anchor text.
Storylines.
Ravi Kumar and D. Sivakumar, in joint work with Uma Mahadevan, propose
a framework to tap the wealth of information hidden beyond the first
page of search results. Their framework is axiomatically
developed and combinatorial in nature. They develop simple and
effective algorithms that discover storylines---windows
that offer glimpses into interesting themes latent in the search
results beyond the top 10.
Crawling
Incremental
crawler. Jenny
Edwards, Kevin McCurley, and John Tomlin outline the construction
of an incremental crawler and
describe an optimization model for controlling
the crawl strategy. The model makes no assumptions about the
statistical behavior of web page changes, but rather uses an adaptive
approach to maintain data on actual change rates which are in turn used
as inputs for the optimization. Computational results show that there
is no single objective---different, but equally plausible, objectives
lead to
conflicting optimal strategies----but, there
are compromise objectives which lead to good strategies that are robust
against
a number of criteria.
|