{"id":2509,"date":"2014-11-13T05:41:43","date_gmt":"2014-11-13T05:41:43","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2509"},"modified":"2014-11-13T05:41:43","modified_gmt":"2014-11-13T05:41:43","slug":"week-9-the-road-ahead","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/prov-notebooks\/week-9-the-road-ahead\/","title":{"rendered":"Week 9 – The road ahead"},"content":{"rendered":"

As mentioned last week, we’d like to develop our own tool to capture provenance information of digital notebooks, so writing an iPython extension is good choice. An iPython extension is like an user defined\u00a0library, and defined APIs can be registered\u00a0that they can be invoked\u00a0in the notebook as functions. So following the same logic of the NoWorkflow system, we developed the CellProv iPython extension that can capture the provenance information from a iPython\u00a0notebook, specifically for the runtime provenance showing how data is transformed, how code is executed and etc. The provenance information of each cell is captured as inputs and outputs of that cell, but functions can be invoked many time in different cells, so the provenance information of function invocations needs to be recorded separately. Here,\u00a0two different types of provenance information is captured:<\/p>\n

Cell provenance:<\/em> if a cell uses variables and functions that have been defined previously, then those variables and functions are treated as the input of this cell. If new variables or\u00a0functions are defined in this cell, those are the output of the cell.<\/p>\n

Function provenance:<\/em> For each function, the argument and return value will be recorded during each invocation of this function.<\/p>\n

Here are some of the registered functions that can be used in a notebook:<\/p>\n

view_prov(type):<\/i> view the captured provenance information, type can be \u201call\u201d, \u201ccell\u201d, \u201cfunc\u201d.
\nview_cell(name) <\/i>and view_func(name)<\/i>: view the provenance of a specific cell or function.
\nsave_file(file)<\/i>: save view_prov(\u201call\u201d) to file.
\nconstruct():<\/i> construct prov information.<\/p>\n

And of course, “%load_ext cellProv.extension” command needs to\u00a0be\u00a0run at the beginning of a notebook to enable\u00a0the CellProv extension.<\/p>\n

Source code can be found in the BitBucket repository<\/a> and a presentation as\u00a0the summary of this summer project including the tools can be found here<\/a>.<\/p>\n

There are some\u00a0future work or improvement:<\/p>\n

1. The provenance information is currently stored in memory and the information from different trials cannot be aggregated together. So a better solution is storing it in a database but we need to consider the space and performance balance\u00a0as well.<\/p>\n

2. Depends on the user’s requirements, different\u00a0provenance information may be needed.\u00a0For now, only a small part of it is captured, it’s a good idea to extend it in the future.<\/p>\n

3. Maintenance and improvement\u00a0should be continued.<\/p>\n

We’ll keep working on the tool and test it on\u00a0different use cases.<\/p>\n","protected":false},"excerpt":{"rendered":"

As mentioned last week, we’d like to develop our own tool to capture provenance information of digital notebooks, so writing an iPython extension is good choice. An iPython extension is like an user defined\u00a0library, and defined APIs can be registered\u00a0that they can be invoked\u00a0in the notebook as functions. So following Continue reading Week 9 – The road ahead<\/span>→<\/span><\/a><\/p>\n","protected":false},"author":31,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[275],"tags":[],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2509"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/31"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=2509"}],"version-history":[{"count":8,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2509\/revisions"}],"predecessor-version":[{"id":2531,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2509\/revisions\/2531"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=2509"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=2509"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=2509"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}