Week 5: Cypher Queries – DataONE Notebooks

The main focus of this week on was on developing a Java code for running Cypher queries and revising some of the queries from last week.

For now, we have a code that can convert a Vistrail provenance XML file into Geoff (a Cypher-like notation for entity labels that uses JSON for associated data) and .dot (Graphviz) formats, and also a set of Cypher creation commands, plus a code for executing the creation commands in batch mode to form a local Neo4j database.

Next, we have to work on a Java code for taking parameters from input and running arbitrary queries on a local database (Neo4j java code template is a good start point) On the side note, this week I had to look through some eclipse and Java online tutorial to get up to the speed of the project.

In order to execute the rest of provenance queries, we need to have more traces in the database (including traces of the multiple run of a workflow for aggregation queries). One suggestion is to do the initial test of the queries on some synthesized provenance traces. In addition, adding new traces to the repository may require modifying the converter to make sure it can convert all of the traces to a format loadable in Neo4j.

One challenge is how to store multiple traces in the PBase repository: They can either be stored in separate instances of Neo4j database or as a large disconnected graph (each sub-graphs can be assigned an ID that is checked when executing the queries or they can all turn into a large connected graph by adding a root node that points to the root of each sub-graph). I am also going to check how Cypher can support queries on more than one graph (this can be multiple traces or a trace and a specification of a workflow), and/or how to join the results of multiple queries. One other task on the project “to-do” list is benchmarking the queries.

Next week, I will mainly focus on writing/testing some more provenance queries.

As usual, feedback and suggestions are welcome!

Leave a Reply Cancel reply