The main focus of this week was to build PBase main repository, set up the development environment, and get more familiar with the format of Cypher queries.
The first thing to do was to translate workflow provenance traces to a Neo4j compatible format. Some of the approaches are: Neo4j Import tool, Neo4j Batch Importer, GraphML Loader to load a graphML file (a XML format for graphs), and XML to Graph Converter (converts a XML file into Geoff, and Cypher creation commands).
Neo4j comes in two forms: A standalone server that can be installed on any machine and is accessible through a REST API, or as a local database embedded in JVM process.
Geoff is a declarative notation for showing graph data with concise human-readable text. It allows defining independent sub-graphs within a graph. There are several container representations for Geoff. For PBase project we used a form of JSON as input for Neo4j database server.
Neo4j’s XML Convertor can be used to convert graphs in XML to Cypher commands and Geoff format but the format of workflow traces is different from this and that is why we need to create a specific convertor for them. A RESTClient then can be used to load the resulting Geoff formatted graph into Neo4j.
The figure below shows three different ways of showing a piece (entities e35, e36, e38 and their relationships) of a Vistrail Wf (from left to right: a Vistrail Wf provenance trace, Graphviz, and Geoff representation).
One other task for this week was setting up development environment and installing required tools, and looking through Cypher online tutorial(s).
Cypher is a declarative (what, not how) graph database query language for Neo4j, based on pattern matching and a SQL-like syntax. Cypher enables a user to ask the database to find data that matches a specific pattern (i.e.“find things like this”). Similar to most query languages, Cypher is composed of clauses. In the most simple case, a query is compose of a START clause followed by a MATCH and a RETURN clause. Other Cypher clauses, WHERE, CREATE, DELETE,.., are very similar to those of SQL. Cypher can be used to show labeled directed relationships, transitive relationship paths, variable length paths ((A)-[*]->(B)), optional relationships (A-[?]->B), ….
(Neo4j)-[:IS_A]->(Graph Database)
(Neo4j)-[:likes]->Cypher
For example, the following queries, in order, show the list of data nodes (nodes that “wasGeneratedBy” actors), the list of data that has been generated but never used, and list of all data nodes that were used in generating node(61).
After last week’s ProvWG meeting, we have a list of general categories of queries that PBase should be able to address. Next week I am going to tailor this list to a more specific list of queries and devote more time on translating them into Cypher.
All feedback and suggestions are welcome and will be appreciated.