Week 9 – The road ahead – DataONE Notebooks

As mentioned last week, we’d like to develop our own tool to capture provenance information of digital notebooks, so writing an iPython extension is good choice. An iPython extension is like an user defined library, and defined APIs can be registered that they can be invoked in the notebook as functions. So following the same logic of the NoWorkflow system, we developed the CellProv iPython extension that can capture the provenance information from a iPython notebook, specifically for the runtime provenance showing how data is transformed, how code is executed and etc. The provenance information of each cell is captured as inputs and outputs of that cell, but functions can be invoked many time in different cells, so the provenance information of function invocations needs to be recorded separately. Here, two different types of provenance information is captured:

Cell provenance: if a cell uses variables and functions that have been defined previously, then those variables and functions are treated as the input of this cell. If new variables or functions are defined in this cell, those are the output of the cell.

Function provenance: For each function, the argument and return value will be recorded during each invocation of this function.

Here are some of the registered functions that can be used in a notebook:

view_prov(type): view the captured provenance information, type can be “all”, “cell”, “func”.
view_cell(name) and view_func(name): view the provenance of a specific cell or function.
save_file(file): save view_prov(“all”) to file.
construct(): construct prov information.

And of course, “%load_ext cellProv.extension” command needs to be run at the beginning of a notebook to enable the CellProv extension.

Source code can be found in the BitBucket repository and a presentation as the summary of this summer project including the tools can be found here.

There are some future work or improvement:

1. The provenance information is currently stored in memory and the information from different trials cannot be aggregated together. So a better solution is storing it in a database but we need to consider the space and performance balance as well.

2. Depends on the user’s requirements, different provenance information may be needed. For now, only a small part of it is captured, it’s a good idea to extend it in the future.

3. Maintenance and improvement should be continued.

We’ll keep working on the tool and test it on different use cases.

Leave a Reply Cancel reply