{"id":2425,"date":"2014-08-09T18:15:35","date_gmt":"2014-08-09T18:15:35","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2425"},"modified":"2014-08-09T18:15:35","modified_gmt":"2014-08-09T18:15:35","slug":"week-7-provenance-questions","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/prov-notebooks\/week-7-provenance-questions\/","title":{"rendered":"Week 7 – Provenance questions"},"content":{"rendered":"
Let’s continue on the front of capturing provenance information on\u00a0the Python script that visualizes a NetCDF file, using NoWorkflow system.<\/p>\n
In previous blogs, I mentioned\u00a0that NoWorkflow provenance has three types: definition, development and execution. For execution provenance, only function activation and file access information are captured, there is not very much provenance for data. So it’s not easy to track data dependency, data lineage and so on. Though we can specify to which depth of function calls the NoWorkflow system captures, some of the functions that are not defined by user. So the provenance information captured may be\u00a0an “over kill” that too much information is captured. Nevertheless, we can still ask some provenance questions based on the information been captured.<\/p>\n
The provenance information is stored in a local Sqlite database and part of the information can be exported to Prolog rules. Then we can use Datalog for example to query the provenance information, which is good for handling recursion and so forth. So in the queries below, it’s a mix of SQL and Datalog statement and of course, we can modify the noWorkflow system to export more information to Prolog rules. So Here is a list of questions can be answered based on the provenance information the NoWorkflow system captures:<\/p>\n
Given a function (CALLEE), what are the functions that call this function: Which function reads land mask data (FILE), which function reads simulation file (FILE) Which function is not called during execution (For certain trial) Which function run the longest: Given a function (FUNCTION_ID), what are the argument value and return value: We can treat this approach as the bottom-up approach that we try to find out all the questions can be answered\u00a0based on all the information we have. Another top-down approach\u00a0is\u00a0from user’ point of view that we can collect what questions the user want to ask and see whether enough information has been captured to answer those questions. So next, I’ll spend some time working with domain scientists to see whether they have particular questions that can be answered with the provenance information we have.<\/p>\n Next week, I’ll put the script in iPython notebook and use cells to split script into different process units and see how we can capture provenance in the cell level.<\/p>\n","protected":false},"excerpt":{"rendered":" Let’s continue on the front of capturing provenance information on\u00a0the Python script that visualizes a NetCDF file, using NoWorkflow system. In previous blogs, I mentioned\u00a0that NoWorkflow provenance has three types: definition, development and execution. For execution provenance, only function activation and file access information are captured, there is not very Continue reading Week 7 – Provenance questions<\/span>
\n(direct call)<\/em> ans(Caller) :- activation_id(Caller, CALLEE)
\n(indirect call)<\/em> ans(Caller) :- indirect_activation(Caller, CALLEE).<\/p>\n
\nans(Function) :- access(FILE, _, “r”, _, _, _, Function).<\/p>\n
\nSELECT name FROM function_def WHERE trial_id=TRIAL AND name NOT in (SELECT name FROM function_activation WHERE trial_id = TRIAL)<\/p>\n
\nshort(Function) :- duration_id(Function, D1), duration_id(Function_ref, D2), D2 > D1.
\nans(Function) :- not short(Function), duration_id(Function, D1).<\/p>\n
\n(Argument)<\/em> SELECT name, value FROM object_value WHERE type = “ARGUMENT” AND \u00a0function_activation_id = FUNCTION_ID
\n(Return)<\/em> SELECT return FROM function_activation WHERE id = FUNCTION_ID<\/p>\n