Week 6 – Improvements of Transparency and Reproducibility

Hi all,

In my week 6, I have mainly improved the scripts for the cases I introduced last week. To remind, there are two cases we defined: 1) transparency (understanding of what has happened and validate the information given) and 2) reproducing (intermediate) outputs.

For transparency, an improvement done is generating fully connected workflow without losing information. The missing information from the previous version were some of nodes are presented in their ids and the workflow is not fully connected. These have been resolved by understanding of ProvONE (provenance model in DataONE). A class called “prov:wasDerivedFrom” first goes into “plan” and “execution” in the ProvONE model. This process cause showing their ids in the provenance relationships. However, by looking at the entire relationships as a graph, this id can be replaced with the actual name of objects which resolves the naming issues in the workflow. For the disconnecting issue, the tool allows to control node types of workflow so by having same type of node resolves the issue. Some minor things to finalize are the direction of the flow should be fixed and the type of node representing the programming scripts should differ from those representing data files. Moreover, I need to find the way that the output of provenance capture tool (e.g., recordr) can be located into a rdf form of provenance which can be used to visualize. Because, this provenance information needs to be kept track throughout the packages that derived from the original package. For example, a scientist “Bob” used “Alice” package to have another analysis and publish as a new package. If the provenance of “Alice” is not recorded in the “Bob” package, then the information lose occurs.

For reproducing the results, an issue was the programming scripts in the data package need to be modified through manual steps. Thus, I have found the way to record the modification in a file. The file recording the updates can be used to patch and the patch makes this process automated. A patch can be generated after updating the original script (there is a command line tool which compares those files and outputs the differences). To complete the patching steps, we can use a script that contain the commands for patching and this can be used for someone who wants to use this data package later on. For example, when a tale is created, this patch file and a script that handling this patching steps can be part of the tale as a utility. A scientist who wants to regenerate the same results as original ones can be automated by having this utility.

The next steps are mainly focusing on 1) how the provenance of the original package can be recorded into the new package which derived from original one (called provenance of provenance) and 2) using the technique recording provenance of provenance (if possible), can we use it to verify whether the provenance is the original or provenance of provenance? (e.g., by looking at this provenance of provenance information, a scientist might be able to judge if the dataset and provenance in it are original). A minor thing to do is to find the tool that convert rdf into YesWorkflow (which present clear view of provenance).

I think this is all for this week.
Hope that all have nice weekend.

Leave a Reply

Your email address will not be published. Required fields are marked *