Project Description
The SystemT project is an amalgam of two major research themes centered
around analytics and search over unstructured content. These two themes
are represented by two corresponding sub-projects:
SystemT-Information Extraction (SystemT-IE) and
SystemT-Programmable Search (SPS).
SystemT-IE: Many enterprises maintain large repositories of unstructured text data, ranging from email and web pages to call-center records and business reports. Unfortunately, this data is of limited use as long as it remains in its unstructured form. Consequently, there has been an increasing interest in enterprise information extraction: building annotators that extract structured information from unstructured enterprise data. Existing information extraction systems have difficulty scaling to enterprise-wide document collections, and building new annotators generally requires specialized expertise and training.
The SystemT-IE project makes information extraction orders of magnitude more scalable and easy to use. Our information extraction system is built around AQL, a declarative rule language with a familiar SQL-like syntax. AQL replaces multiple obscure languages typically used to build annotators. Because AQL is a declarative language, rule developers can focus on what to extract, allowing SystemT-IE's cost-based optimizer to determine the most efficient execution plan for the annotator. SystemT-IE's information extraction engine is currently deployed in many IBM products (Lotus Notes, IBM eDiscovery Analyzer, etc.) and is being used in several ongoing research projects.
Our current research is focused on building a complete tooling framework around SystemT-IE aimed at facilitating the development and maintenance of extraction rules, for both expert and non-expert users. We are interested in the algorithmic aspects, as well as the user experience aspects of the framework. Specific research directions include:
Regular expression learning and contextual clue discovery to facilitate the development of basic building blocks such as regular expressions and dictionaries
Provenance-based rule visualization and automatic refinement to facilitate the understanding and development of complex extraction rules
Mechanisms for exposing complex extraction rules to various types of users to facilitate domain-specific customization and to easily combine existing rules to solve novel extraction problems
SPS: SPS is a platform for developing and integrating high quality search into a wide range of enterprise applications. There are two key ideas underlying SPS:
Concept-based search: The main idea of concept-based search is that user query terms are not merely matched against document text but "interpreted" (i.e., associated with a specific meaning) in the context of a search taxonomy. For example, consider the domain of personal email search with an appropriate taxonomy describing the concepts that people often look for in their emails -- persons, phone numbers, addresses, etc. Given such a taxonomy, SPS will interpret the query "linda address" as "find emails that mention person linda's address", as opposed to merely retrieving emails containing the words "linda" and "address". In general, the task of producing all possible such interpretations and narrowing down to the likely ones of interest to the user is non-trivial. The SPS platform comes built-in with sophisticated general-purpose algorithms to perform precisely these tasks.
Rule-driven: Unlike classical approaches to search that depend on complex weights and hard-to-unravel ranking functions to achieve search quality, SPS adopts a fully transparent rule-driven approach. Every steps of the search process -- from tokenization and parsing, to index term generation, interpretation, and ranking -- is controlled by rules with well-defined semantics. This ensures that searches that are configured to work in a certain way continue to reliably produce the expected results even as underlying data collections and their statistics change. In addition, the rule-driven approach enables application developers and domain experts (e.g., a search administrator) to easily configure and customize SPS components to a specification application or deployment scenario. Most importantly, often such customization can be accomplished without writing a single line of code or tweaking ranking function parameters.
Project Contact: Howard Ho
Publications
Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan: "Domain Adaptation of Rule-based Annotators for Named-Entity Recognition Tasks". To Appear in EMNLP 2010.
Bin Liu, Laura Chiticariu, Vivian Chu, H.V. Jagadish, Frederick Reiss: "Automatic Rule Refinement for Information Extraction". To Appear in PVLDB 2010.
Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan: "SystemT: An Algebraic Approach to Declarative Information Extraction". ACL 2010.
Ronald Fagin, Benny Kimelfeld, Yunyao Li, Sriram Raghavan and Shivakumar Vaithyanathan: "Understanding Queries in a Search Database System". PODS 2010
Laura Chiticariu, Yunyao Li, Sriram Raghavan, and Frederick Reiss: "Enterprise Information Extraction: Recent Developments and Open Challenges". SIGMOD 2010 (tutorial) [Presentation]
David Simmen, Fred Reiss, Yunyao Li, Suresh Thalamati: "Enabling Enterprise Mashups over Unstructured Text". SIGMOD 2009 (Demonstration)
Eirinaios Michelakis, Rajasekar Krishnamurthy, Peter J. Haas, Shivakumar Vaithyanathan: "Uncertainty management in rule-based information extraction systems". SIGMOD Conference 2009: 101-114
Rajasekar Krishnamurthy, Sriram Raghavan, and Huaiyu Zhu: "Evolution of Rule-Based Information Extraction: From Grammars to Algebra", Tutorial given at CIKM 2008.
Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, H. V. Jagadish: "Regular Expression Learning for Information Extraction". EMNLP 2008: 21-30
Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, Shivakumar Vaithyanathan: "An Algebraic Approach to Rule-Based Information Extraction". ICDE 2008: 933-942
Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu: "SystemT: a system for declarative information extraction". SIGMOD Record 37(4):7-13 (2008)
Huaiyu Zhu, Alexander Loeser, Sriram Raghavan, Shivakumar Vaithyanathan: "Navigating the intranet with high precision". WWW 2007.
Shivakumar Vaithyanathan: "Towards Declarative Information Extraction: The Almaden Story", Industrial KeyNote talk at Web Intelligence, 2007.
Eser Kandogan, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, Huaiyu Zhu: "Avatar semantic search: a database approach to information retrieval". SIGMOD 2006 (demonstration).
AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan. "Managing Information Extraction". Tutorial at SIGMOD 2006.
Yunyao Li, Rajasekar Krishnamurthy, Shivakumar Vaithyanathan, H.V.Jagadish: "Getting Work Done on the Web: Supporting Transactional Queries". SIGIR 2006.

