This week, being the first week of my internship, my main focus was to gain an understanding of the problem domain. The meeting with my mentors and some other members of Scientific Workflows Provenance Working Group (ProvWG) helped me to get familiar with some of their past projects and make a list of papers that are closely related to PBase project and start by looking through those.
Here is an overview of some of the key concepts:
— A workflow (Wf) is a structured composition of program as a sequence of activities that aim at a specific result. Computational steps in a Wf are called “actors”. Thus, a Wf can be thought as a collection of actors with data and control dependencies.
— A scientific workflow management system (SWfMS) controls orchestration and execution of scientific Wfs by enabling an automatic collection of Wf data products and their provenance.
— Provenance refers to historical information that shows how data products were generated and what was their transformation process. This can be used to verify the quality of data, make experiment documentation, etc.
— Two main types of provenance are prospective and retrospective. The former is represented by Wf specifications (modeled as a UML activity diagram), whereas the latter can be represented by an execution log.
— The provenance information can be gathered at three semantic levels: OS, Wf, and activity.
The main goal of this summer project is to develop a provenance management architecture, called PBase, that integrates with core DataONE architecture. PBase is aimed to support advanced query and analytic capabilities on composition of provenance information produced by independence workflows that share some of their data.
One of the open problems is which provenance data should be gathered and how. To address which data should be gathered, there is an attempt toward a standard model Open Provenance Model (OPM). OPM uses a causality graph to capture the dependencies between agents, processes, and artifacts.
In Data Tree of Life project [MLB+10], an abstract model for Wfs and their associated provenance traces is used that enables provenance interoperability and integration. In particular, it is shown that provenance data in heterogeneous environments can be unified as a “virtual experiment”, if the repositories used for sharing data can map different identifiers to reference the same dataset, and the provenance traces can be mapped to a common model (in this case, OPM). The Golden Trail project [MLB+11] is a continuation of that effort that aims at enabling the scientist to generate the provenance of their valuable result, which can be assumed as a view on virtual experiment and technically is a subset of a larger provenance graph.
Provenance management in distributed heterogeneous environments is still an open issue. In [MMW+12], a provenance manager system, Provmanager, for these environments is introduced. Provmanager works at activity level because it is SWfMS independent and allows a collection of more precise and complete provenance data compared to OS level strategy. However, the problem of this level is the need of automatically adapting preexisting mechanisms to incorporate provenance gathering functionalities.
There are some issues arising here: (1) Wf specifications may be written in different languages, (2) how to adapt an activity to gather provenance information. In [MMW+12], these issues are addressed by (1) using specific adapters for each SWfMS. e.g. VisTrails, Kepler, Taverna. (2) the adaptation is done indirectly by modifying the specification and inserting new activities responsible for collecting provenance information and encapsulating them together with original Wf activities.
One other related issue is querying provenance information. Regular Path Queries (RPQ) is a core graph query language to answer pattern-based reachability queries. Given a labeled graph G, a RPQ given by expression R returns those pairs of nodes (x,y) which are connected by a path in G such that the concatenated labels match R. Traditional RPQs may not be sufficient for certain information requirements. In [DCK+13], a variant of RPQs that uses the notion of provenance of RPQ results is introduced.
Next week I will be attending ProvWG meeting at NYU Poly during which we are going to discuss more on some technical details of the project and make decisions for next steps. This is also an opportunity for me to have a face-to-face meeting with the project mentors as well as some other members of DataONE ProvWG.
References.
[MLB+10] Missier, Paolo, Bertram Ludäscher, Shawn Bowers, Manish Kumar Anand, Ilkay Altintas, Saumen Dey, Anandarup Sarkar, Biva Shrestha, and Carole Goble. “Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science.” In Procs 5th Workshop on Workflows in Support of Large-Scale Science (WORKS), 2010.
[MLB+11] Missier, Paolo, Bertram Ludäscher, Shawn Bowers, Ilkay Altintas, Saumen Dey, and Michael Agun. “Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository.” In Procs. 7th International Digital Curation Conference.
[MMW+12] Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a provenance management system for scientific workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513-1530.