Data provenance has important applications in databases, from optimization, to verifying trustworthiness of answers, to helping programmers with debugging. LogicBlox is fortunate to employee two of the world’s experts in data provenance, Grigoris Karvounarakis and T.J. Green. Their Ph.D advisors, Prof. Zachary Ives and Prof. Val Tannen of UPenn, are active and leading researchers in this area. This latest work by Grigoris, Zach, and Val discuss a query language for analyzing the graph that results from provenance tracking.
Abstract Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, “Is this data derivable from trusted tuples?”; “What tuples are derived from this relation?”; or “What score should this answer receive, given initial scores of the base tuples?”. Such questions can be answered by consulting the provenance of query results. In recent years there has been significant progress on formal models for provenance. However, the issues of provenance storage, maintenance, and querying have not yet been addressed in an application-independent way. In this paper, we adopt the most general formalism for tuple-based provenance, semiring provenance. We develop a query language for provenance, which can express all of the aforementioned types of queries, as well as many more; we propose storage, processing and indexing schemes for data provenance in support of these queries; and we experimentally validate the feasibility of provenance querying and the benefits of our indexing techniques across a variety of application classes and queries.