Last week, I attended the SSDBM 2011 conference at Portland, OR, USA. I presented our work “PROPUB: Towards a Declarative Approach for Publishing Customized, Policy-Aware Provenance”. In this work, we show how to balance between (i) the desire to publish provenance data so that collaborators can understand and rely on the shared data products, and (ii) the need to protect sensitive information, e.g., due to privacy concerns or intellectual property issues. We view provenance as a bipartite, directed, acyclic graph, capturing which data nodes were consumed and produced, respectively, by invocation nodes (computations).To protect privacy, a scientist can remove sensitive data or invocations nodes from the provenance graph. Alternatively, she can abstract a set of sensitive nodes by grouping them into a single, abstract node. These updates may violate some of the integrity constraints of the provenance graph. For example, grouping multiple nodes into one abstraction node may introduce new dependencies, which are absent in the initial provenance graph. Removing nodes may also make some nodes in the final graph appear independent of each other even though they are dependent in the original graph. Thus, one can no longer trust that the published provenance data is correct (e.g., there are no false dependencies) or complete (e.g., there are no false independencies). We propose a system that allows a publisher to provide a high-level specification what parts of the provenance graph are to be published and what parts are to be sanitized, while guaranteeing that at the same time certain provenance publication constraints are observed. To achieve all these, we develop PROPUB (our Provenance Publisher), which allows the user (i) to state provenance publication and customization requests, (ii) to specify provenance policies that should be obeyed, (iii) to check whether the policies are satisfied, and (iv) to repair policy violations and reconcile conflicts between user requests and provenance policies should they occur.
There were two papers, which showed applications of provenance data:
(i) Sean Riddle presented their work on “Improving Workflow Fault Tolerance through Provenance-based Recovery”. He stated that scientific workflow systems frequently are used to execute a variety of long-running computational pipelines prone to premature termination due to network failures, server outages, and other faults. He mentioned that researchers have presented approaches for providing fault tolerance for portions of specific workflows, but no solution handles faults that terminate the workflow engine itself when executing a mix of stateless and stateful workflow components. In this work, they develop a general framework for efficiently resuming workflow execution using information commonly captured by workflow systems to record data provenance. This approach facilitates fast workflow replay using only such commonly recorded provenance data. They also propose a checkpoint extension to standard provenance models to significantly reduce the computation needed to reset the workflow to a consistent state, thus resulting in much shorter reexecution times.
(ii) James Frew presented their work on “Provenance-Enabled Automatic Data Publishing”. He stated that scientists are as likely to use one-off scripts, legacy programs, and volatile collections of data and parametric assumptions as they are to frame their investigations using easily reproducible workflows. He mentioned that the ES3 system can capture the provenance of such unstructured computations and make it available so that the results of such computations can be evaluated in the overall context of their inputs, implementation, and assumptions. He described a system that, given the request to publish a particular computational artifact, traverses that artifact’s provenance and applies rule-based tests (e.g. source code checked into a source control system, data accessible from a well-known repository, etc) to each of the artifact’s computational antecedents to determine whether the artifact’s provenance is robust enough to justify its publication.
There were discussions on the use of provenance data. I am listingย two questions:
(1) if provenance data could be used to find out why and how two data artifacts are different.
(2) if provenance data could be used to find out if a data artifact is corrupted.