Week 8 – Stitching two packages using provenance

Hi all,

In my week 8, I have focused on: storing additional provenance information that a result in a package is derived from which object in another package (we call it stiching for now), and updating the bundling script introduced last week as well as the test of a tale in Whole Tale.

When we bundle the new analysis product (called Bob’s package), we first create a metadata file, and then add objects to the package along the created metadata. There is a block called “dataTables” or “otherEntities” in metadata which turns out to present the metadata information for each object in the DataONE landing page. If this block is missing, then the provenance information won’t be showing even if the package already contains provenance. In the previous version, this block for each objects was missing and, thus, I have added to created the blocks and write into metadata file before publishing. Another minor fix is when we add the object to the data package, the file format information can be classified (shown in a table that lists all the objects in the landing page) which ignored for full automation in the previous version. Since I have found that there is a function existing “guess_format_id”, the corresponding part in the script has been updated.

In the new package, what is missing is whether this package is original or this has generated by using another package. To keep track of this information, we simply add two triple information: 1) which object in another package (called Alice’s package) is used for Bob’s analysis and 2) what is the object in Alice’s package is used. To create the relations, the id of Alice’s package and of the object are encoded in the triple and using “describeWorkflow” function (i.e., to use to add provenance) the relationship can be embedded into the Bob’s package. This information is not visible in the landing page of DataONE for the purpose of keeping the information clean. However, when another scientist downloads this package, it can be easily extracted (e.g., using the transparency script). Since the relationships contain the actual link of the objects, the scientist can directly download the original dataset through the link.

At last, I have also tested the scripts that I have created (transparency, reproducibility, and bundling) in a tale (Whole Tale). The stitching test will be included in the next phase of test in the tale. One of issues that I faced was the main directory of the tale is read-only. To resolve this, we have to use outside of the main directory to keep the files generated during the process (correctly assigning the paths). This restriction does not allow to use the patch file directly, because the original script needs to be patched to proceed. Thus, for now, we use patch files to keep which parts of scripts have updated. Furthermore, the read-only permission also affects to “unzip” function in R (we cannot use a path for pointing the zip file location in the function, returning error). Thus, the paths need to precisely routed before and after the unzip function if exists in scripts.

The next things to do would be having more cases for stitching (e.g., now we assume that Bob uses only Alice’s package, but there might be the case that Bob uses multiple packages for his analysis) and thinking of which way of representation (either providing link for the original object or just name of the object or both if possible).

I think this is all for this week. Hope that all have nice weekend.

Leave a Reply Cancel reply