IBM®
Skip to main content
    United States [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Information Discovery for the Enterprise

Computer Science


 Overview

Modern business enterprises need to integrate thousands of highly heterogeneous information sources containing structured, semi-structured, or unstructured data. These sources include web sites, syndicated data feeds, content-management warehouses and marts, email archives, spreadsheets and other office documents, and both standardized and custom application programs. The traditional data-warehouse approach of information integration through "manual" schema design and ETL is no longer feasible, because this methodology does not scale and relies on expensive human resources.

The goal of the project is to develop a next-generation integrated database management system that takes traditional information integration, content management, and data warehouse techniques to the next level. The system is being designed to automatically discover, unify, and aggregate data from a large number of disparate sources and construct a global business view in terms of "Universal Business Objects". Our approach incorporates novel methods for automated "data cartography" that rests on synopsis-construction techniques such as sampling and hashing, statistical techniques for similarity analysis, pattern matching via automata, graph-analytic techniques for complex object identification, and algorithms for automatic discovery of business-object metadata. The system will provide a powerful querying interface that supports a wide variety of information-retrieval methods, including ad-hoc querying via intuitive keyword-based search, graph querying based on similarities, automatic report generation, graphical and traditional OLAP, spreadsheets, data mining algorithms, and real-time dashboards. Enabling technologies for our solution include XML and XQuery, web services, caching, messaging, and portals for real-time dashboarding and reporting.

Important ongoing research issues include the development and empirical evaluation of dataset signatures and similarity measures, scalability and deployment studies, invention of more sophisticated methods for business object discovery, exploitation of ontologies, incorporation of new query processing paradigms, efficient processing of a huge variety of document schemas in an XML repository, and management of evolving schemas with minimal manual intervention. We are trying to extend the system to exploit not only open standards such as XML, but also to deal with domain-specific standards such as XBRL for financial solutions. In the long run, our system can potentially lead to an enormous enhancement of business productivity by providing users with a rich querying environment over information not only within the enterprise, but in the entire supply chain and beyond.

Members

  • Berthold Reinwald (Manager)
  • Andrey Balmin
  • Paul Brown
  • Peter Haas
  • Yannis Sismanis
  • Alkis Simitsis (post-doc visitor)
  • Wensheng Wu (post-doc visitor)

Bibliography

Overview

  • Toward Automated Large-Scale Information Integration and Discovery. Paul Brown, Peter J. Haas, Jussi Myllymaki, Hamid Pirahesh, Berthold Reinwald, Yannis Sismanis; In: Data Management in a Connected World, T. Haerder and W. Lehner, eds., Springer-Verlag, 2005: 161-180

Analysis

  • GORDIAN: Efficient and Scalable Discovery of Composite Keys. Yannis Sismanis, Paul Brown, Peter J. Haas, Berthold Reinwald; VLDB 2006: 691-702
  • A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets. Rainer Gemulla, Wolfgang Lehner, Peter J. Haas; VLDB 2006: 595-606
  • Techniques for Warehousing of Sample Data. Paul G. Brown, Peter J. Haas; ICDE 2006 6
  • CORDS: Automatic discovery of correlations and soft functional dependencies. Ihab Ilyas, Volker Markl, Peter J. Haas, Paul Brown, and Ashraf Aboulnaga; SIGMOD 2004: 647-658
  • BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. Paul Brown and Peter J. Haas; VLDB 2003: 668-679

Querying

  • Document-Centric OLAP in the Schema-Chaos World. Yannis Sismanis, Hamid Pirahesh, Berthold Reinwald; In Proc. BIRTE Workshop 2006, (collocated with VLDB)

InfoDisc


    About IBMPrivacyContact