Skip to main content

SystemT

Project Description

SystemT The SystemT project is an amalgam of two major research themes centered around analytics and search over unstructured content. These two themes are represented by two corresponding sub-projects: SystemT-Information Extraction (SystemT-IE) and SystemT-Programmable Search (SPS).

SystemT-IE: Many enterprises maintain large repositories of unstructured text data, ranging from email and web pages to call-center records and business reports. Unfortunately, this data is of limited use as long as it remains in its unstructured form. Consequently, there has been an increasing interest in enterprise information extraction: building annotators that extract structured information from unstructured enterprise data. Existing information extraction systems have difficulty scaling to enterprise-wide document collections, and building new annotators generally requires specialized expertise and training.

The SystemT-IE project makes information extraction orders of magnitude more scalable and easy to use. Our information extraction system is built around AQL, a declarative rule language with a familiar SQL-like syntax. AQL replaces multiple obscure languages typically used to build annotators. Because AQL is a declarative language, rule developers can focus on what to extract, allowing SystemT-IE's cost-based optimizer to determine the most efficient execution plan for the annotator. SystemT-IE's information extraction engine is currently deployed in many IBM products (Lotus Notes, IBM eDiscovery Analyzer, etc.) and is being used in several ongoing research projects.

Our current research is focused on building a complete tooling framework around SystemT-IE aimed at facilitating the development and maintenance of extraction rules, for both expert and non-expert users. We are interested in the algorithmic aspects, as well as the user experience aspects of the framework. Specific research directions include:

SPS: SPS is a platform for developing and integrating high quality search into a wide range of enterprise applications. There are two key ideas underlying SPS:

Project Contact: Howard Ho

Publications

[an error occurred while processing this directive]