Weekly updates are posted below.
During this year’s DataONE summer project, our primary goal was to develop a provenance repository and an interface to interactively browse the provenance graphs. The summer project team met with members of the Provenance Working Group (ProvWG) during June this year at the University of California Davis and found the following as the primary components of the proposed provenance repository: (i) workflow schema definition and its evolution, (ii) workflow execution details (i.e. processing history or the dependency graph), and (iii) stitching multiple dependency graphs.
To fit these with the summer project, we decided to start with a minimal provenance model and later extend it as needed (as part of the ProvWG). We also decided to develop a web-based application. This application would allow a scientist to (i) upload trace files, (ii) interactively build query, and (iii) view the result in a tabular format or as an interactive dependency graph.
We chose the following technologies:
(2) Neo4j: a graph based NOSQL database solution. One reason to try out Neo4j was to gain experience with a “non-standard” database technology that is advertised as being suitable for graph data. This can be used as an embedded in a J2EE application or as an installed database server. We used the server version for our application. It provides a multi-threaded version (to achieve high availability) in addition to the standard version. Neo4j views the world as a graph with three things, Nodes, Relations and properties. Here relations are the edges between nodes and properties are tags as key-value pairs. We modeled run, invocations, and data artifacts as nodes and dependencies as relations. We used properties in many cases. For example, we used node type property to tag whether a node is a run, invocation or a data artifact.
We have also implemented the provenance repository using MySQL.
(3) Graphviz: an open-source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. In the graph we developed using this visualization tool; we represent data as circles and invocations (instances of an actor/process) as boxes. After executing a query, Neo4j returns a java object, which we convert into DOT compatible format. This DOT is used by Graphciz to produce the final graph. Finally, we show this graph in the “Display” tab in our application. Our current use of Graphviz is to generate a static image of the graph.
(5) Tomcat (Apache Tomcat): is an open source software implementation of the Java Servlet and JavaServer Pages technologies. We used Tomcat as both of our web and application server.
We have successfully deployed a minimal provenance model into the Neo4j database server and developed the “GoldenTrail” application using GWT. We integrated the web application with the Neo4j database. Currently, our application (http://kelor.genomecenter.ucdavis.edu:8080/GoldenApp/GoldenApp.html) is hosted on a UC Davis Server.
We have tested the following:
– Upload Kepler/Comad trace file
– Develop query using the functionalities mentioned earlier
– Execute the query, and
– View results in a tabular format and in static graph format.
I am continuing working on this project as part of my PhD program to complete the following:
– View the result as an interactive graph using JIT/FD.
– Upload multiple trace files and stitch them either automatically or manually.
last week I worked on rewriting storage and retrieval to make it easier for Graphviz to draw the trace graphs that we create. I have it working for generating the basic graphs, but I have been having some problems with REST traversals in Neo4j and am currently working on the problem through the Neo4j forums. In the version I was using I had to use a strange combination of spaces and underscores in key/value names in the traverser that I had to determine by trial and error. I think that I must have downloaded it at just the wrong time because I have downloaded Neo4j again and that problem is resolved, but I have so far been unable to get relationships on nodes in the traverser return filter. We decided to use Neo4j because it makes traversals very easy and is optimized for them, but since there is at present very little documentation on traversals using the REST API, it has been slow going to make use of those easy traversals. I am hoping that someone on the Neo4j forums can help me because otherwise I will have to write some of the traversals not using the traversal system, which then means we gain less by using Neo4j (most of my traversals are working correctly, so we should still be able to gain some).
Sorry for the late update this week, but I was on a family vacation (cruise) all last week with almost no internet, and have been having computer problems since I got back that I just solved last night. I will be giving another update at the end of this week when I have got more done. Last week I added documentation to the code in all my REST wrappers to make it easier for other team members to make use of, and to make it easier to go back and extend or make use of in the future, cleaned up the code so it is easier to work with and extend, and added traversals to my REST wrappers for accessing Neo4j. I also wrote a query function that traverses the graph backwards and finds data dependencies in the trace graph using Neo4j’s traversal system. The Neo4j REST documentation is not really very complete yet, so I had to do some of it by trial and error, but I got it working. This week I am going to be rewriting how we do storage and retrieval in the database to make the integration with Graphviz that Saumen has been working on easier.
Last week, I developed couple of key components. First, I developed the provenance trace parser for COMAD (an extension of Kepler). In fact, while developing this component, I reused lot of java/SQL code from last years DataONE Summer Code of Project. Last year, we developed various parsers around our MySQL database and that was bit of challenge this time as it was very tightly coupled with the database calls. But at the end, I have successfully decoupled it from those database calls and modified it based on our generic bean layer. Also, I am in the process of implementing an “Abstract Factory Pattern” (a core J2EE design pattern) so that I can enforce certain rules while integrating trace files from other systems. Second, Michale Wang (from University of California, Davis) and I completed the integration with Graphviz. Graphviz is an open-source graph visualization software. Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks. It has important applications in networking, bioinformatics, software engineering, database and web design, machine learning, and in visual interfaces for other technical domains. In the graph we developed using this visualization tool, we represent data as circles and invocations (instances of an actor/process) as boxes. During this integration we installed Graphviz in the server and our application would interact with it through the driver class we developed this week. We had bit of difficulties to call Graphviz executables from our Tomcat web-server. We found that Graphviz needed a temporary directory for it’s internal workings. So, We created the directory /war/dot/tmp for it’s use. Michale Wang wrote the program to create dot file (in the format required by Graphviz) from our beans. This dot file would be stored in /war/dot directory. Graphviz would read this dot file and produce the graph as an image file in the same directory. This image file would be presented on the web-site.
So far, we have been able to complete all the key components for our basic provenance model (the gen-by/used provenance). This week we would add the GUI features and integrate all these components so that our mentors can start testing them. That means we would need the pPOD trace files soon.
I have developed the work breakdowns for rest of our development efforts. This is available at https://spreadsheets.google.com/spreadsheet/ccc?key=0Av5CaPcogJCkdGdxQU45WFpwYVNRbUoweXluQVcwVlE&hl=en_US&pli=1#gid=5. I have created the very high level architecture document which is available at https://docs.google.com/drawings/d/1mvRey8-PmXtkhxhXDvwO11iF_Cn-ZwFIKhv3DsA0I5Y/edit?hl=en_US.
I explained our project initiative to Michale Wang to bring him on-board.
When we started this week, we had a long list of questions, e.g. (1) how to deploy a GWT (Google Web Toolkit) project to Tomcat, (2) how to interface with Neo4j (a graph database) through our GWT project, (3) how to integrate GWT project with Neo4j, Tomcat, and Graphviz (a graph visualization tool), (4) how to develop a equivalent graph model in Neo4j for the D-OPM minimal model, etc. I think this week has been the very productive week and we have found answers for all the above-mentioned questions.
I have developed a prototype where I was able to deploy a GWT project to a Tomcat 6 server. This GWT project interfaced with a Neo4j database. It was quite challenging and interesting to resolve the issues as I encountered them. It is important to understand that GWT compiler does not work in the similar way of the Java counter part. GWT compiler needs to see the source codes of all references. That makes it very hard to add external jar files to the client side. One has to make sure that the external jar includes all source files. Once you have a jar file which includes respective source codes, the following are the steps for adding external jar to the client side, (i) create a xx.gwt.xml file in the java project / jar file which you want to use (to instruct the GWT compiler to use), and (ii) use the included library via the inherit definition in .gwt.xml file.
If it is ok to have the jar file (or the reference project) at the server side (meaning the client side is not tightly coupled with it), there is an easier way to add external jars into the server side. Following are the steps for adding external jar to the server side, (i) create a service by extending RemoteService at the client side, (ii) create the asynchronous counterpart of that service, (iii) create the service implementation at the server side, and (iv) keep the jar files at the server side, and (v) refer to the jar from the server side service implementation.
While I was busy with establishing this integration, Michael developed a REST API prototype so that we can interface with the Neo4j database. I used that prototype to complete the integration I mentioned earlier.
I worked with Michael to develop the graph model in Neo4j for our DOPM and we also developed the data object layer, which should help us to make our client tier independent of our choice of database. You can find the Neo4j graph model we developed at https://docs.google.com/document/d/1KpW1sTwzcIqODl-O_u1vvv_wY2qMtzQChoCm-nNNkn4/edit?hl=en_US
Finally, following are the steps one would need to deploy a GWT project to Tomcat 6, (i) execute GWT compiler: this will create javaacript files which are optimized for different browsers, (ii) create the jar file for only src directory, (iii) create a deployment descriptor as shown below, (iv) run the deployment descriptor to deploy the code, (v) test the application: http://localhost:8080//.html. Use respective hostname.
This week I created an eclipse project for our GWT app and uploaded it to the Google Code svn I set up so Saumen and I can start collaborating on code. We decided against Google App Engine and Saumen is working on our Web and Neo4j database server. I spent most of this week working on REST wrapper code for Neo4j, as well as query and upload code to make use of the wrappers.
The GWT App I created currently just shows the overall layout we decided on for the app last week, but gives us a starting point to add all our other code to. The Google Code svn server has had intermittent problems (inaccessible for a couple minutes at a time), but seems to be working overall. We decided the sandbox architecture used by Google App Engine may give us some problems, so we are going to do our own servers, and Saumen is working on setting up our Web server and Neo4j database server.
When using Neo4j in server mode you must access it using REST. To me REST is not very pretty to look at, read, or write, and has a big learning curve, so I have spent this week learning REST, then writing wrapper classes so I don’t have to deal with it again. As I have finished parts of the REST wrappers, I have started working on the functions that upload a run (a single run of a workflow) trace to the server. I have completed the upload portion and it seems to be in working order. It is written such that there is no “linking” step required to link traces that share data, as linking data nodes in the graph is part of the upload process. I have also completed the query function that retrieves a run back out of the database, so we have both upload and a single query to test it with. The Neo4j Administration webpage has been very useful in testing, because I can see the contents of the database as a graph and visually verify that the upload worked correctly, and not just that the query is broken too.
Overall this week has been very successful and we have got a lot done towards having a working App.
Last week, I have completed the conceptual schema design for “genby/used” provenance, which is the core of our proposed DOPM. Now, we would build on this to add different lands (e.g. workflow land, data land, context land etc) on an incremental basis.
I have installed GWT on MyEclipse 9.0. I worked through the example use case (provided by Google). It worked on my local environment and on the Google App Engine as well.
I developed a set of user interface design for uploading, linking and querying provenance data using PowerPoint. We have discussed on this design and finalized. Now, we would develop our GUI using GWT based on this design.
This week we finalized the basic conceptual model we will use for our first iteration, came up with our GUI design for our front end program, started working with the Google Web Toolkit (GWT), and also refined the use cases that drive our project.
The first thing we did this week was finish the conceptual model for run-level provenance. Saumen did most of this, but we met several times during the week to make sure we agreed on the model and understood it, and to refine the model until we were happy with it. Saumen also led the GUI design, while I started work on the GWT code side of the app.
I spent most of this week working (and fighting) with GWT. It wasn’t until late in the week that I actually got the Google Plugin for Eclipse installed and working. I have gone through several tutorials on GWT and written some test programs to understand how to create GWT applications with the features we will need. For most of this week I was using Eclipse without the plugin because I was having a lot of trouble with the Eclipse plugin installer. I ended up manually downloading and placing many of the Google Plugin files, because Eclipse would almost always fail trying to download the larger files, even though my connection is fast and reliable. Getting the plugin working is making working with GWT much easier and should speed up development. I have also started looking into Google App Engine as a place to host our app, which the Google Plugin for Eclipse also makes much easier.
The final task for this week has been to start working on the Upload and Query APIs. We have started compiling a high level list of what the APIs will need to accomplish, and will use this in developing the actual APIs next week.
For next week we hope to have the first iteration of our Query and Upload APIs completed, along with an initial GWT app that can make use of the APIs
This week, I developed the first version of DataONE Provenance Model (DOPM). In this model, I integrated the trace land with the workflow land and context land. In our weekly call, we reviewed this model. I developed a small prototype using Graphviz to present a conceptual model based on the definition provided in the DOT (a Graphviz tool) file. I would want to explore this later and make it a complete tool. This may be very good tool to convert asii definition of a model to a good visual presentation. I have used ErWin and MySQL Workbench earlier. But, these tools are very good at developing the logical and physical models. This week, I used Microsoft PowerPoint to develop the conceptual model. I looked at the Neo4j, a graph based NOSQL database solution. This can be used as an embedded in a J2EE application or as an installed database server. It provides a multi-threaded version (to achieve high availability) in addition to the standard version. Neo4j views the world as a graph with three thins, Nodes, Relations and properties. Here relations are the edges between nodes and properties are tagged with the nodes as key-value pairs. Now, I am defining the page layouts for the web-site we would develop using GWT (Google Web Toolkit) which would allow scientists to upload the trace files, link trace files, query provenance data and view provenance data. Next week, I will finalize the DOPM for run-level provenance.
This week I have been working with Saumen on understanding the model we will be using to start writing our APIs. We have decided to define the model sufficiently to start writing our APIs with minimal rework, and to continue to elaborate our conceptual model as we work on our APIs.
I spent the first part of the week going through the code from last year’s project so we can determine what we can reuse and what needs to be done again or changed. I have continued to work with Neo4j and written some test code so we can start trying to fit our conceptual model into Neo4j. I did this by creating some dummy workflow/trace graphs and writing some functions that traverse the graph to find data that depends on a given entity in the trace graph, or to find what data a given entity depends on. I also have started working on how we will be able to look deeper into workflows where we have more detailed provenance, without having to do anything extra for blackbox workflows that have only input and output information. This might be accomplished by having the workflow node in the graph have relationships to the input and output actors inside the workflow. Having some very simple graphs to traverse for testing is helping us with our list of questions that the API will have to answer.
We are hoping to this week have at least some interfaces for what our query API is going to look like, which first requires having the conceptual model finished for run-level provenance, which saumen is working on. Our current plan is to use the Google Web Toolkit to create our applications, so that they can easily be accessed from the web, so starting to work with GWT is my next task.
This week I attended the DataONE Provenance WG Meeting at Davis on 6/7/2011 and 6/8/2011. I got a good understanding of the scope and use of the proposed provenance repository from this meeting. I understood that a domain scientist should be able load the trace files (workflow run details) from one or more workflow runs into the repository and then he should be able to query and view the derivation history of any output data products across multiple runs. Here, the workflows may have been developed using various workflow systems (e.g. Kepler, Taverna, Pegasus, etc). During this meeting, we discussed the conceptual model for D-OPM (DataONE Open Provenance Model) in detail. We planned to extend OPM to capture workflow specifications, complex data structure and context information (e.g. who ran the workflow, when the workflow was run, etc). I have started evaluating available open source conceptual modeling tools to select one for our purpose. I have also evaluated the graph traversal lightweight database Neo4j 1.3. It can support up to 12 billion nodes. So far, Neo4j seems to be a very good tool for our purpose. Also, I am reviewing last years (DataToL SoC 2010) reports for reuse.
This past week I attended the DataONE Provenance Working Group Meeting at UC Davis on 6/7 and 6/8 where we discussed what the DataONE Open Provenance Model (D-OPM) should look like when completed.
For me the main focus of this week has been to try to fully understand what this project is going to accomplish and what our overall goals are. We are hoping to have a provenance repository that can stitch provenance together at the workflow-run level, including between different workflow systems. Other than the WG meeting, my main tasks for this week have been becoming familiar with the shared provenance model that was developed in last year’s internship, so that we can make use of what we can from last year and not just repeat what was done last year. Later today I will be meeting with the other intern who participated in last year’s project to talk about the model. I also set up our Google Code site and have begun looking at a graph based database system (Neo4j) to use for storing the provenance and dependency information for traversal. Neo4j looks like a very good option because it makes traversal of graphs very simple, and that is what stitching together data dependencies will consist of.
Have a good weekend,
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.