IBM®
Skip to main content
    United States [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Hippocratic Database Technology

(none)


Scientific Achievements for Intelligent Information Systems at IBM Almaden Research Center

Scientific Achievements

The goal of our data privacy research is to invent and build next generation information systems that preserve privacy without impeding the flow of information.  Technical achievements include:

Hippocratic Management of Data

Historically, database systems were designed for the most efficient storage, access and querying of information.  We articulated a vision for Hippocratic databases that embrace the responsibility for the privacy of data managed as a fundamental tenet, defined an architecture for this kind of database and identified technical challenges and problems.   We then implemented a vertical slice of the core functionality of the Hippocratic database design. 

We allow an organization to encode its data practices in IBM’s Enterprise Privacy Authorization Language (EPAL) or W3C’s P3P policy specification language.  The policy is first shredded, and the necessary meta information is stored inside the database as relational tables. The next main component matches an organization’s privacy policy with individual user preferences, stated in XPref (an XPath-based language designed by us and published in WWW).  Data is collected from a user only if the organization’s privacy policy conforms to the user’s preferences.  Lastly, a query issued by an application is programmatically modified in the extended JDBC driver to reflect the privacy rules stored as meta-information, opt-in/opt-out choices made by the users, and any higher-order constraints. Thus, the augmented query is able to resolve conditional access privileges at the level of the individual cell in a data record without requiring modification of enterprise applications.

We validated the prototype by implementing an enforcement scenario from the Health Insurance Portability and Accountability (HIPAA) act for a healthcare provider. Our extensive experiments show that the performance overhead of privacy enforcement is small and scalable, and often times the overhead is more than offset by the performance gains obtained through tuple filtering.  This is the first high-performance implementation of cell-level access control.

Privacy Preserving Analytics (Mining & OLAP)

Conventional wisdom says that data mining and privacy are adversaries, and the use of data mining must be restricted to protect privacy.  We resolved this dilemma by inventing privacy preserving data mining that exploits the difference between the level at which we care about privacy, i.e., individual data, and the level at which we run data mining algorithms, i.e., aggregated data.  User data is randomized such that it is impossible to recover anything meaningful at the individual level, while still allowing our novel algorithms to recover aggregate information, build mining models and provide actionable business insights.

The computation of multidimensional aggregates is the essence of on-line analytical processing (OLAP) used in business intelligence systems.  The goal of this work is to devise techniques for accurately computing data cube aggregates on data partitioned across multiple clients, while preserving the privacy of individual records. Extending the randomization approach to privacy we invented for privacy preserving data mining, we make every client perturb its recordbefore sending it to the server. The randomness used in perturbing the values ensures information-theoretic record-level privacy.  The serverruns queries on the resultant perturbed table. The query meant for the original tableis translated into a set of queries on the perturbed table. The answers to these queries are then combined and reconstructed to obtain the result to the original query with bounded error. We show that our techniques are safe against privacy breaches. In the process, we introduce the formalism for reconstructible functions on a perturbed table, and develop algorithms to reconstruct distributions for multiple attributes together. These techniques also apply for database tables in which some of the attributes are categorical. They are also applicable in settings in which the database tables are partitioned horizontally or vertically. 

Auditing Compliance

New legislation such as HIPAA requires enterprises to have the ability to retroactively demonstrate that their data management practices have complied with the declared policies. To support this requirement, new capabilities are needed in database systems to quickly and precisely identify all the queries (if any) that accessed the data of interest during a certain time-period in the past.  At the same time, this new functionality should not disrupt ongoing data operations in the enterprise.  Logging all the data accessed by individual queries is simply impractical both in terms of impact on query performance and on log storage costs.

We invented a novel audit system to address these challenging requirements.  Rather than logging records accessed by a query, our system logs only the query string and the rest of the query processing takes place unencumbered.  In an offline step, the database recovery log is used to create backlog tables that allow views to be defined to logically reconstruct any past state of the database.  Thus, the system puts minimal overhead on normal query processing.

To perform an audit, the auditor formulates an expression specifying the data whose disclosure needs to tracked. Audit expressions are designed to be identical to SQL queries, allowing audits to be performed at the level of an individual cell of a table. The audit expression is processed by the audit query generator, which first performs a static analysis of the expression to select candidate queries that could potentially disclose the specified data. It then combines and transforms the selected queries into an audit query by augmenting them with additional predicates derived from the audit expression. This audit query (expressed in standard SQL), when run against the backlog database, yields the precise set of queries that accessed the designated data.  Data structures designed for the backlog tables make the execution of the audit query fast.  The correctness of the audit is based on the formalization of the intuition that the suspicious query and the audit expression must share an indispensable record whose omission will affect the result of the query.

Sovereign Information Sharing

We have developed a radically new paradigm, called sovereign information sharing, to fulfill the requirements for integrating information across autonomous entities, arising out of trends such as end-to-end integration, outsourcing, and security.  Current information integration approaches, as exemplified by data warehouses and mediator-based data federations, are inadequate for this purpose as they are based on the assumption that the data in each database can be revealed completely to the other databases.  In our model, the execution of a query over two different databases does not reveal any information apart from the result of the query to either entity.  The applications for this model include information exchange between security agencies, intellectual property licensing, crime prevention and medical research.

We defined the basic operations in this model, developed algorithms for their execution, proved their correctness, and analyzed their performance.  Our analysis shows that these algorithms are more than 100 times faster than algorithms based on ideas in the current security literature (Yao’s combinatorial circuit).  We also built a prototype implementation of the model as a grid of data services.  Each entity participating in sovereign sharing offers a web service that allows sovereign operations to be applied to the data they own.  The data resides in DB2, the web services run on top of the WebSphere Application Server, the metadata is described in a UDDI registry, and Apache SOAP is used for communication.

Privacy Enforcement in XML Repositories

Given the increasing importance of XML data, we extended our Active Enforcement concepts for relational data to XML repositories.  First, we designed a privacy policy language, inspired by W3C’s P3P and endowed with features to specify policies for tree-structured content. An interesting aspect of this language is the handling of policy exceptions when the disclosures permissible at a node differ from those of the ancestors.  The policy annotator component of this enforcement architecture parses a policy and generates annotations in the form of XML fragments, which are then stored in the repository. At query time, the query rewriter accepts an application context and a XPath query and rewrites it to incorporate policy constraints and individual choices. The result of the rewritten query is the set of data sub-trees rooted at the XPath target elements with private information filtered out.  

Optimal Data Anonymization

We developed optimal anonymization algorithms for protecting the privacy of individuals while supporting external release of data, based on a powerful technique known as k-anonymization. Other approaches to de-identifying data are prone to linkage attacks that combine the subject data with other publicly available information to re-identify represented individuals.  For example, consider a data set of patient records that has been scrubbed of any personal identifiers such as name or social security number. While no record contains any single identifying value, many records are likely to contain unique value combinations. A person who is the only male born in 1960 living in some sparsely populated area could have his age, gender, and zip code joined with a voter registry from the area to obtain his name, thereby revealing his medical history. A k-anonymized data set ensures that each record is indistinguishable from at least k-1 other records within the data set. The larger the value of k, the greater the implied privacy since no individual can be identified with probability exceeding 1/k through linking alone.

Unfortunately, the problem of optimally k-anonymizing a dataset is NP-hard.  Consequently, previous algorithms for the problem all resort to approximation or incomplete heuristic search.  Our algorithm finds optimal k-anonymization on practical real-life data sets.

Order Preserving Encryption

Encryption is a well established technology for protecting sensitive data. However, once encrypted, data can no longer be easily queried aside from exact matches. We designed an order-preserving encryption scheme for numeric data that allows any comparison operation to be directly applied on encrypted data. Query results produced are sound (no false hits) and complete (no false drops). Our scheme handles updates gracefully and new values can be added without requiring changes in the encryption of other values. It allows standard database indexes to be built over encrypted tables and can easily be integrated with existing database systems. The proposed scheme was designed to be deployed in application environments in which the intruder can get access to the encrypted database, but does not have prior domain information such as the distribution of values and cannot encrypt or decrypt arbitrary values of his choice. The encryption is robust against estimation of the true value in such environments.

SQL Extension for Enforcing Privacy Policies

We designed constructs for imbuing relational database systems with fine-grained access control (FGAC) and show how they can be used to enforce disclosure control outlined in the vision for Hippocratic databases. These constructs were designed to be integrated with the rest of the infrastructure of a relational database system. We also developed the implementation of the proposed FGAC constructs and developed techniques for algorithmically  translating  privacy policies written in a higher-level specification language such as P3P  into the proposed constructs.

RFID Traceability

The EPCglobal Architecture Framework describes the EPC network consisting of EPCglobal core services and EPCglobal subscribers. In the network, each party provides EPC Information Service (EPCIS). The EPCIS Locator component represents an abstraction of the EPCglobal core services such as Object Name Service (ONS) and Discovery Service (DS). While such a network resembles distributed databases or peer-to-peer systems, there are a number of notable differences between traditional distributed or peer-to-peer databases and the EPC network:

  • There is no central database management system (DBMS) in the EPC network. Typical distributed database systems have a central administrative node that maintains information about the other database nodes, such as nodes that are currently online, and the fragmentation of data across different nodes. Such information is not present anywhere within the EPC network.
  • The EPC network can consist of hierarchies of EPC nodes. For instance, an EPC network might exist within an enterprise with restricted access (similar to a corporate intranet) which, to the external, publicly accessible EPC network, will appear as one EPC node. Such hierarchies are not common in distributed or peer-to-peer systems.
  • Data fragmentation can be arbitrarily complex within the EPC network. In distributed database systems, fragmentation is typically done at the column or row level. In the EPC network, however, a node might know some subset of information about a specific EPC, and a different subset of information about another EPC. Also, unlike distributed database or peer-to-peer systems, no global schema exists in the EPC network.
  • Querying on EPC data can involve query operators that are difficult to implement. For example, a query that requests containment information inside an EPC will make use of a distributed recursive operator, or a distributed join across multiple databases, both of which are not easy to design given the above issues.
Any querying system that leverages the EPC network will need to address these issues in its design. Our work focuses on developing an EPCIS querying system that will meet the following goals:
  • Provide a flexible and extensible query engine architecture for handling a variety of EPC queries in an efficient manner, such as lineage (pedigree / recall), data mining, and others while leveraging query processing power of standard relational databases;
  • Allow efficient querying of EPC data across multiple enterprises while respecting the sovereignty of individual parties;
  • Connect EPC data to other existing datasources at the enterprise, such as point of sale, inventory, and product referential data in support of querying, datamining, and other data management activities.
Our research has led to the development of Theseos technology.

Data Pointillism

Commercial enterprises generate a flood of business transactions involving millions of individual people, products and other physical entities every day. Collecting as much background knowledge as possible about any given entity is essential for making informed and timely decisions, as well as for quick identification of all entities relevant to a query. Entity resolution, the problem of identifying and grouping together all attribute values and transaction records that belong to the same entity, has attracted a great deal of attention in the data mining community.

Entity resolution accumulates dynamic entity images from the stream of records it observes and links each entity image to all records associated with the corresponding entity. This could be easy if each record had an attribute that uniquely described the entity (a key); in reality, however, attribute values are shared across multiple entities, while records can be outdated, incomplete, ambiguous, contain errors or even be intentionally wrong. So, entity resolution presents serious technical challenges.

One popular approach to entity resolution is based on repeated matching and merging of transactions or partial entity images by applying a set of matching rules. We are evaluating the accuracy of this algorithm, designing methods for automatic selection of high-quality matching rulesets given a training entity-labeled dataset and studying ways to achieve higher performance in novel application domains. We are also developing mathematical foundations for entity resolution based on statistics, machine learning and data mining theory, and combinatorics. In these studies we hope to gain insights that may apply in a wider setting of "data pointillism", the formation of complex data models by coherently combining many small fragments of evidence. Our research also extends to analyzing and developing security and privacy protection mechanisms for entity resolution.


    About IBMPrivacyContact