Information Integration

Project Description

The Information Integration Group has recently started a new project called Midas, with the goal of extracting, cleansing, and integrating data from multiple, publicly available, data sources. We are initially focused on two domains: 1) a financial domain, where the input dataset consists of a heterogeneous collection of company filings with the US Securities & Exchange Commission (SEC), and 2) a US government domain, with data sources containing information about Congress members, earmarks and federal spending. In both domains, we are building a scalable Hadoop-based system where the goal is to transform the data from a document or record view of the world to an object-centric view, where multiple facts about the same real-world entity are merged into one object with, ideally, clean and complete attributes.

One of the most salient features of our system is the synergistic integration into one framework of multiple components spanning the entire, end-to-end integration flow. The main stages in such flow are:

Our research aims to develop novel algorithms and tools as well as scalable and reusable software modules for all the different stages mentioned above. In particular, we are looking at new algorithms for entity resolution that can be integrated with mapping and fusion algorithms and that can be applied on a continuous basis (i.e., as new documents or data sources are discovered). We are also investigating new ways of visualizing, browsing and understanding the integrated objects and their relationships. Finally, one of our most important goals is defining high-level abstractions and models that can be used to specify, at a high-level and declaratively, the entire integration flow. In turn, this will enable the applicability of the resulting framework and system to new domains (beyond financial and government) and to new users (i.e., domain experts that are not necessarily data integration experts).

Project Contact: Rajasekar Krishnamurthy


