During this year’s DataONE summer project, our primary goal was to develop a provenance repository and an interface to interactively browse the provenance graphs. The summer project team met with members of the Provenance Working Group (ProvWG) during June this year at the University of California Davis and found the following as the primary components of the proposed provenance repository: (i) workflow schema definition and its evolution, (ii) workflow execution details (i.e. processing history or the dependency graph), and (iii) stitching multiple dependency graphs.
To fit these with the summer project, we decided to start with a minimal provenance model and later extend it as needed (as part of the ProvWG). We also decided to develop a web-based application. This application would allow a scientist to (i) upload trace files, (ii) interactively build query, and (iii) view the result in a tabular format or as an interactive dependency graph.
We chose the following technologies:
(2) Neo4j: a graph based NOSQL database solution. One reason to try out Neo4j was to gain experience with a “non-standard” database technology that is advertised as being suitable for graph data. This can be used as an embedded in a J2EE application or as an installed database server. We used the server version for our application. It provides a multi-threaded version (to achieve high availability) in addition to the standard version. Neo4j views the world as a graph with three things, Nodes, Relations and properties. Here relations are the edges between nodes and properties are tags as key-value pairs. We modeled run, invocations, and data artifacts as nodes and dependencies as relations. We used properties in many cases. For example, we used node type property to tag whether a node is a run, invocation or a data artifact.
We have also implemented the provenance repository using MySQL.
(3) Graphviz: an open-source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. In the graph we developed using this visualization tool; we represent data as circles and invocations (instances of an actor/process) as boxes. After executing a query, Neo4j returns a java object, which we convert into DOT compatible format. This DOT is used by Graphciz to produce the final graph. Finally, we show this graph in the “Display” tab in our application. Our current use of Graphviz is to generate a static image of the graph.
(5) Tomcat (Apache Tomcat): is an open source software implementation of the Java Servlet and JavaServer Pages technologies. We used Tomcat as both of our web and application server.
We have successfully deployed a minimal provenance model into the Neo4j database server and developed the “GoldenTrail” application using GWT. We integrated the web application with the Neo4j database. Currently, our application (http://kelor.genomecenter.ucdavis.edu:8080/GoldenApp/GoldenApp.html) is hosted on a UC Davis Server.
We have tested the following:
– Upload Kepler/Comad trace file
– Develop query using the functionalities mentioned earlier
– Execute the query, and
– View results in a tabular format and in static graph format.
I am continuing working on this project as part of my PhD program to complete the following:
– View the result as an interactive graph using JIT/FD.
– Upload multiple trace files and stitch them either automatically or manually.
– Writing the technical report for this project: attached is the correct version.