Week Eight Update – DataONE Notebooks

Week Eight Goals

This week the goals for this research effort were to focus on building SPARQL queries against the datasets for all three repositories (KNB, ORNL DAAC and Dryad). In order to do this I setup a Drupal server that would not interfere with my existing server and added a sparql tool to it. In addition, I needed to generate the ORNL DAAC RDF and create some RDF queries retrieving DataONE data across the three data repositories.

Results

I created a new Drupal server and added the ARC2 library to it. I loaded the RDF data from the three repositories and ran specific SPARQL queries on it. As I loaded the data some of my initial queries did not work as expected. This turned out to be an issue with the Dryad data changing and as a result some of the original data was not being retrieved. I am incorporating changes to fix this. As a result though, I have a sparql query tool with which I can run RDF queries against all RDF data that I load. For now, this is only DataONE data.

Through the SPARQL query tool I am able to extract all data, which was an initial check for me that the loading and query were working. I can also search for more specific data like ‘give me all Articles’ and then more specific queries, e.g., about authors. I will be updating the query link later this week where I will show sample queries for other DataONE related data. Initially all seemed quite straightforward but I struggled to get some of the data to work, this was due to an issue with the RDF, which I am fixing. The good news is that once these queries work, they will include data from all three repositories.

The next step is to add RDF data from a larger RDF dataset (e.g. dbpedia or data.gov) and run queries that will link DataONE data with those. Then I will be able to demonstrate Use Case 2, where I access DataONE data from a cloud of data. I will be starting that this week. The DBpedia data is accessible through a DBpedia sparql endpoint and the ARC2 library allows for integration with remote data stores. The data.gov data will require that I contact someone from that group to provide me mechanisms for accessing relevant RDF data triples.

Observations

For the ORNL DAAC data, I decided to take a different approach. Instead of trying to work my way from metadata to data, I just retrieved RDF data from the datasets that I could access from the ftp site. For two of the files I could see how they relate and how the RDF data may be structured to be useful for finding this data. In some cases I used common vocabularies, e.g., dcterms, and in others I made them up specific to a DAAC namespace. The latter was because there was no existing vocabulary to use. Still, there were cases where the data was too unique to the dataset. By this I mean even if the data were in RDF, there would still need to be additional knowledge as to what that data meant. Mapping these types of datasets seems useless unless you understand how they will be used. This is because without some useful interface, this information is difficult to understand and thus search for.
The web is constantly evolving. It does not matter how careful you are at creating links, things change, thus links are lost. This might occur because an XML structure changes, because the value of the content changes, etc. As I have worked with the many tools for this project I have found that this is inevitable, especially on new ‘hot’ topics like the LOD cloud. Still, if we are considering a LOD infrastructure, there needs to be some support for adjusting to the evolution of the LOD cloud, in particular when items are modified or deleted.
As I close off the process of choosing vocabularies – finally I am seeing ORNL DAAC RDF data, I am reminded of how challenging vocabulary searches can be. First, the task of finding vocabularies that describe what is needed, then sifting through similar properties and equivalent settings, and finally, assuring that the vocabulary is the best choice and available on the Web. One example was bibo vs the prism namespaces. The bibo namespace had a similar vocabulary as did prism, bibo even defines equivalent properties that align with prism. Both namespaces have been mentioned in references that I have read for this research. When I searched, I could not resolve a single link to the prism RDF spec, even though I found multiple ‘supposed’ links. I really had no choice but to choose the bibo vocabulary. This process is too time consuming and tedious. It seems that there should be an easier way to identify relevant vocabulary and all vocabulary definitions should be Web accessible, e.g., resolve to RDF.

Research Links

These are resources that have been created from this research effort.

Dryad only RDF can be found at : http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DryadData.rdf
KNB only RDF can be found at : http://rio.cs.utep.edu/ciserver/ciprojects/sdata/KNBData.rdf
ORNL DAAD RDF on its way
All DataONE RDF, in one URL, can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DataONEData. This file is just a mashup of the individual files for each repository.
Project items under source control can be found at : https://github.com/hlapp/LOD4DataONE
Documentation on the java code can be found at : http://rio.cs.utep.edu/ciserver/sites/default/files/lod4dataone/doc/index.html
An initial powerpoint slide show of the work done up to week 3 can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/udata/LOD4DataONEWeek3EX.ppsx
Powerpoint slides covering the work for Use Case 1: where DataONE data can be browsed via an RDF browser, can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/udata/LOD4DataONEWeek5.pptx
Use case notes can be found at: https://notebooks.dataone.org/lod4dataone/use-cases/
Project notes can be found at: https://notebooks.dataone.org/lod4dataone/notes

Week Eight Goals

Results

Observations

Research Links

Leave a Reply Cancel reply