Week 7 – Bundle all together – DataONE Notebooks

Hi all,

This is my week 7. I have mainly done 3 major things: 1) final check of transparency and reproducibility plus creating a script that bundling newly generated products together to publish at once, 2) testing the new bundle in Whole Tale (WT), and 3) brain storming of how to keep track of provenance of original package throughout the package derived from the original one.

For transparency, there was some minor issues that need to be checked, i.e., whether in the generated workflow the node type, edge type, edge label, and the direction of workflow can be changeable. The workflow is generated based on the triple (subject,predicate,object) which interpreted into a rdf in xml. The direction of the flow is default, i.e., from subject to object, which can be changed if subject values are switched to object values. However, this requires changes of representation of provenance, e.g., wasDerivedFrom would turn out to be derived. This change is possible, but the problem is the change would be against to the provenance standards defined in the ProvONE. Unfortunately, full automation is not possible by now, but another workflow (e.g., YesWorkflow) is still available using provenance information we have retrieved / captured. During transparency, some missing information are collected, e.g., is the package original (expanding the question to provenance related which is if not original where is provenance of original package), which would be considered as a next thing to do.

Since we now know what are inputs, outputs, scripts, and how to run the scripts, an assumption has made that myself as an another scientist have generated my own products after some analysis. Here, a question arose is if we can do create a package with the new products and publish to a DataONE repository automatically. Thus, what I have done is to create a script which create a package, capture and add provenance into the package, and publish to DataONE by executing a single R script (i.e., basically create three R scripts corresponding each part in the bundling link including some necessary setup, e.g., install required packages). Furthermore, this bundling script including transparency and reproducibility have tested in a tale of WT (dev server does not work to create a tale, but use production one in private mode). An issue found is the tale allows read-only, and a question I got is how we can check the transparency is correctly done and even reproducibility cannot be completed (e.g., 2 output artifacts generated by using a intermediate dataset created by some other datasets, then we first need to generate the intermediate result which is not possible).

At last, I have had a brainstorming of how to keep track of provenance of original package throughout the followed packages that derived from the original one. Since the DataONE package allows the nesting structure (e.g., a package contains another package), we might be able to have a parent package that contain the derived packages as well as original one. Then, provenance information of the derivation (e.g., a package “B” is derived from an original one “A) would be able to be added in to the resource map of parent package. A question here is whether the nesting can be used for this purpose (nesting represents a subcategory of one big analysis in general). However, this might be able to be done by simply adding an information in triple that shows the connection of, e.g., “B” was derived from “A”. How this bridge for the provenance connection would be the major next step along the minor issues mentioned above (e.g., those from WT testing).

I think this is all for this week.
Hope that all have nice weekend.

Leave a Reply Cancel reply