Week 9 – The road ahead

As mentioned last week, we’d like to develop our own tool to capture provenance information of digital notebooks, so writing an iPython extension is good choice. An iPython extension is like an user definedĀ library, and defined APIs can be registeredĀ that they can be invokedĀ in the notebook as functions. So following the same logic of the NoWorkflow system, we developed the CellProv iPython extension that can capture the provenance information from a iPythonĀ notebook, specifically for the runtime provenance showing how data is transformed, how code is executed and etc. The provenance information of each cell is captured as inputs and outputs of that cell, but functions can be invoked many time in different cells, so the provenance information of function invocations needs to be recorded separately. Here,Ā two different types of provenance information is captured:

Cell provenance: if a cell uses variables and functions that have been defined previously, then those variables and functions are treated as the input of this cell. If new variables orĀ functions are defined in this cell, those are the output of the cell.

Function provenance: For each function, the argument and return value will be recorded during each invocation of this function.

Here are some of the registered functions that can be used in a notebook:

view_prov(type): view the captured provenance information, type can be ā€œallā€, ā€œcellā€, ā€œfuncā€.
view_cell(name) and view_func(name): view the provenance of a specific cell or function.
save_file(file): save view_prov(ā€œallā€) to file.
construct(): construct prov information.

And of course, “%load_ext cellProv.extension” command needs toĀ beĀ run at the beginning of a notebook to enableĀ the CellProv extension.

Source code can be found in the BitBucket repository and a presentation asĀ the summary of this summer project including the tools can be found here.

There are someĀ future work or improvement:

1. The provenance information is currently stored in memory and the information from different trials cannot be aggregated together. So a better solution is storing it in a database but we need to consider the space and performance balanceĀ as well.

2. Depends on the user’s requirements, differentĀ provenance information may be needed.Ā For now, only a small part of it is captured, it’s a good idea to extend it in the future.

3. Maintenance and improvementĀ should be continued.

We’ll keep working on the tool and test it onĀ different use cases.

Leave a Reply

Your email address will not be published. Required fields are marked *

*