Week Nine Update – DataONE Notebooks

Week Nine Goals

This week the goals for the LOD4DataONE research effort were to access DataONE data from queries of outside data like DBpedia and Data.gov. By loading data from both the DBpedia and Data.gov sparql endpoints and collecting it with the exisitng DataONE RDF that was created from KNB, ORNL DAAC and Dryad data, queries could be created showing relationships between them, e.g., due to species or location.

Results

I moved the code for the Drupal windows into a module, this makes it easier for me to add and display queries dynamically; also I can add this code to Github. I was able to extract data from Data.gov and DBpedia through their sparql endpoints. You can see all queries at http://manaus.cs.utep.edu/lod4d1, select the LOD4DataONE Queries menu on the left hand side, then select a query to see results. I am glad to see data but I still feel that the view is lacking context and usefulness when a user gets them, in particular people who see the results of these queries.
I will be starting the final week of research. My plans are to finish up the demonstration from the data.gov and dbpedia data to dataone data, add to the queries based on community input and my own ideas, update and publish a use case document outlining the work I have done, and finally document my final conclusions of the work. Links to all of this will be found on the lod4dataone notebook. Thus, the demonstration of this research effort can be seen from the search tool as it will use the RDF that was created for the 3 repositories to answer questions using a small cloud of data.

Observations

As I finalized some RDF work, in particular for my effort to extract the data from the ORNL-DAAC datasets I considered more automation, where I could use a generic tool to do this for me. There are Web-based discussions about automation and RDF conversion tools, for example at http://www.w3.org/wiki/ConverterToRdf several tools are mentioned, although the types discussed focus little on those that are being published from scientific experimentation on the ORNL-DAAC, Dryad and KNB repositories. There are several issues that I found with extracting scientific data for creating RDF, thus automation would be tricky: some files are tightly coupled with other files, data fields are inconsistent in settings, data is duplicated across datasets, properties used more than once having different meanings, data in same file with different structures and inconsistencies of the data values. These types of issues require that the data extraction be totally or partially manual, but it also brings up a concern that was mentioned earlier during this research: data context and the concern with fragmenting scientific data thus losing its overall meaning. As a result, RDF for searching for data may be useful but there should be concerns with using generic tools for the mapping of scientific data to RDF.
As I am looking through the work I did I found that I was using links to connect the DataONE data out and when there were no links, I depended on vocabulary, e.g., types or properties, to show relationships in a bigger context. The issue with depending on types and properties is that for a query to find data, the RDF must be in the datastore. In the distributed Semantic Web, this is not always the case, thus, searching for data that is not linked is limited.
There is a lot of work already in motion wrt linked data, where similar people are understanding how to convert their data to RDF and make it available on a bigger cloud. I am certain this was one reason why this internship was identified. Many groups are working to document this, e.g., LOD community and RPI, and many have been focused on providing pages and videos to help. Despite the availability of this information, support for understanding this information is dispersed, chaotic and unstable, where things work then break, connections are made then lost and much and the tools and documentation assume certain levels of knowledge or experience. It seems that the reality of publishing linked data for smaller organizations, individual scientists and even citizen science, will be limited by these types of issues.

Research Links

The SPARQL Query page, with example queries can be found at: http://manaus.cs.utep.edu/lod4d1
Dryad only RDF can be found at : http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DryadData.rdf
KNB only RDF can be found at : http://rio.cs.utep.edu/ciserver/ciprojects/sdata/KNBData.rdf
ORNL DAAD RDF can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DAACData.rdf
All DataONE RDF, in one URL, can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DataONEData. This file is just a mashup of the individual files for each repository.
Project items under source control can be found at : https://github.com/hlapp/LOD4DataONE
Documentation on the java code can be found at : http://rio.cs.utep.edu/ciserver/sites/default/files/lod4dataone/doc/index.html
An initial powerpoint slide show of the work done up to week 3 can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/udata/LOD4DataONEWeek3EX.ppsx
Powerpoint slides covering the work for Use Case 1: where DataONE data can be browsed via an RDF browser, can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/udata/LOD4DataONEWeek5.pptx
Use case notes can be found at: https://notebooks.dataone.org/lod4dataone/use-cases/
Project notes can be found at: https://notebooks.dataone.org/lod4dataone/notes

Week Nine Goals

Results

Observations

Research Links

Leave a Reply Cancel reply