|AIM -- Automatic Informative Metadata|
As the world of networked computers continues to explode, searching for textual information begins to take on completely new dimensions. Searching for information is a hard enough problem, but even knowing where to start looking is now tremendously difficult.
A naive solution to this problem is to conceptually join together all of the text from the entire world, put it in one virtual database and allow searchers to run queries against it. There are several problems with this approach:
AIM has the goal of solving these problems by concisely describing available information sources. These concise descriptions can be compared with queries to produce a limited list of information sources which need to be searched to give the desired result.
What sort of descriptors should be extracted from the information sources? The descriptors we choose are designed to fulfill three requirements:
The first requirement is obvious but makes the point that there is no reason to extract something from the source just because it is computationally feasible. The extracted descriptors must map well into the ways that searchers describe the information they are looking for.
The second requirement makes maintenance realistic. Manual indexing and characterization cannot be relied upon to maintain a dynamic environment like the World-Wide Web.
The third requirement makes the descriptors useful for information-agent technology. Our goal is to make AIM descriptors a useful language for agents to converses with when trying to fulfill user information needs. For agent technology to work, the agents must be able to describe the information they are looking for, and information brokers must be able to describe the information they know about. In order for brokers to keep track of a large number of potential sources, the descriptions of those sources must be concise.
AIM descriptors currently fall into three classes: date, writing style, and subject.
AIM characterizes an information source by scanning it for dates. These dates can be represented in several standard styles, such as "January 23, 1994", "23 January 1994", "1/23/94", etc. All of the observed dates are collected and then statistics are reported. These statistics include the mean, standard deviation, range of the center 90% of the dates, and a histogram showing the distribution. These data can be used to determine if a particular information source is relevant for a timely query. The histogram can also be used to determine the "activity" of the source. Do the frequency of dates indicate explosive growth (like the comp.infosystems.www.providers) or decaying popularity (like sci.physics.fusion after the cold fusion furor passed)? Are the cutoff dates sharp, indicating that this is an archive of a larger collection? These sorts of questions can be answered by looking at date statistics.
AIM extracts simply writing style information which can help describe a source. One measure is the percentage of words which appear in valid sentences (determined by looking at the structure of all-alphanumeric words and punctuation) which differentiates between character graphics, tabular information, lists, and normal grammatical text. The fraction of sentences which are questions and exclamations can also easily be determined. Questions occuring with a frequency of 7-9% indicate a discussion forum, as opposed to declaratory text. This simple measure differentiates between USENET newsgroups and technical papers which are not otherwise easily separated (by vocabulary, for instance). Finally, AIM measures a grade-level of readability for the text by looking at the average number of words in sentences and syllables in words. This measure likewise differentiates between technical papers (11-14th grade), newswire reports (8-10th grade), and USENET news (6-8th grade).
These writing style measures are useful in determining the applicability of an information source for many queries. For example, if I am looking for fixes for a bug in Windows, I am only interested in newsgroups and not in newswire services or technical papers. This writing style information can help to narrow searches to produce better results with lower computational expense.
The most important parameter for determining the applicability of an information source is its subject. If the searcher is looking for information about camels that is easy to read, then reports on the space shuttle are not useful no matter how simple they are! AIM determines the subject matter of an information source by assuming that subject is determined by a vocabulary-use signature. This signature consists of a list of words and frequencies which indicate how often that each word is used when discussing the given subject.
AIM uses some collection of pre-catagorized text to determine the subject signatures. For instance, USENET news can be used as a subject categorization system with each newsgroup representing a subject category. By analyzing a USENET archive, the vocabulary usage of each newsgroup can be determined by discovering the distinctive features of each newsgroup. An unknown information source can be assigned a list of matching subjects by doing a similarity comparison with the information gathered from the archive. This list of matching subjects then describes the subject of the information source.
Queries in AIM consist of lists of requirements that information sources should meet. These requirements can include any combination of date, style, and subject descriptions. Subject descriptions can be generated from sample keywords by analyzing the keywords analogously to an information source. So a searcher could describe the subject of a source as "fusion, tritium, neutrons" and AIM would show that these words correspond to the formal subject "sci.physics.fusion".
These AIM queries can then be quickly run against large collections AIM descriptors for the various information sources. The result is a scored list of information sources where the score indicates the degree of relevancy of the source to the AIM query.
After locating the potential sources, detailed queries can then be run against them to find the desired information. Since the generation of high-quality detailed queries is also difficult, we have developed the FIRE system. FIRE uses boolean relevance feedback to suggest new search terms which will refine queries.
|IBM Almaden Research Center|
|650 Harry Road|
|San Jose, CA 95120-6099|
Last updated: Thursday, 21-Jul-2011 15:25:04 PDT
[ Almaden Research
Center | IBM
[ IBM home page | Order | Search | Contact IBM | Help | (C) | (TM) ]