Week 7 – Provenance questions

Let’s continue on the front of capturing provenance information on the Python script that visualizes a NetCDF file, using NoWorkflow system.

In previous blogs, I mentioned that NoWorkflow provenance has three types: definition, development and execution. For execution provenance, only function activation and file access information are captured, there is not very much provenance for data. So it’s not easy to track data dependency, data lineage and so on. Though we can specify to which depth of function calls the NoWorkflow system captures, some of the functions that are not defined by user. So the provenance information captured may be an “over kill” that too much information is captured. Nevertheless, we can still ask some provenance questions based on the information been captured.

The provenance information is stored in a local Sqlite database and part of the information can be exported to Prolog rules. Then we can use Datalog for example to query the provenance information, which is good for handling recursion and so forth. So in the queries below, it’s a mix of SQL and Datalog statement and of course, we can modify the noWorkflow system to export more information to Prolog rules. So Here is a list of questions can be answered based on the provenance information the NoWorkflow system captures:

Given a function (CALLEE), what are the functions that call this function:
(direct call) ans(Caller) :- activation_id(Caller, CALLEE)
(indirect call) ans(Caller) :- indirect_activation(Caller, CALLEE).

Which function reads land mask data (FILE), which function reads simulation file (FILE)
ans(Function) :- access(FILE, _, “r”, _, _, _, Function).

Which function is not called during execution (For certain trial)
SELECT name FROM function_def WHERE trial_id=TRIAL AND name NOT in (SELECT name FROM function_activation WHERE trial_id = TRIAL)

Which function run the longest:
short(Function) :- duration_id(Function, D1), duration_id(Function_ref, D2), D2 > D1.
ans(Function) :- not short(Function), duration_id(Function, D1).

Given a function (FUNCTION_ID), what are the argument value and return value:
(Argument) SELECT name, value FROM object_value WHERE type = “ARGUMENT” AND  function_activation_id = FUNCTION_ID
(Return) SELECT return FROM function_activation WHERE id = FUNCTION_ID

We can treat this approach as the bottom-up approach that we try to find out all the questions can be answered based on all the information we have. Another top-down approach is from user’ point of view that we can collect what questions the user want to ask and see whether enough information has been captured to answer those questions. So next, I’ll spend some time working with domain scientists to see whether they have particular questions that can be answered with the provenance information we have.

Next week, I’ll put the script in iPython notebook and use cells to split script into different process units and see how we can capture provenance in the cell level.

Leave a Reply

Your email address will not be published. Required fields are marked *

*