Week Seven Update – DataONE Notebooks

Week Seven Goals

The goals for this week were to focus on data from a larger source that would link to DataONE data, in support of use case 2. To this end, I considered how existing data on the LOD, e.g., location or species, is related to DataONE and how to find DataONE data from a cloud of RDF knowledge. Doing this will require that I create new RDF with the proper types and properties so that data can be retrieved by a software agent, i.e., via a sparql query.

Results

I started this week focusing on location and species as important qualities of DataONE data. I like the Geospecies Website and ontology defined at http://geospecies.org by Peter DeVries at the University of Wisconsin. The Geospecies ontology is not too large for a demonstration and it expresses location information as well as species information. The initial plan was to add the needed qualities to DataONE data so it could relate to other geospecies things, add knowledge on locations and species from other RDF data sources, then provide some contextual view of the data where users could drill down to more specific geospecies knowledge and find their way to DataONE data. The textual views of subject, predicate and object links that is the current norm in viewing linked data may support very generic data browsing but these lists can be daunting. What seemed useful about the Geospecies site is that it provides a hierarchical representation of the data where the classifications are shown with links to related data on the Web. In the Geospecies site, from a family name, i.e., Lycosidae, you are shown links to other data on the Web like NCBI or Wikipedia. There are about nine links to other related data sites and the links are just searches into those sites. The disconnect for LOD is that the links do not return information in RDF limiting the generic LOD browsing that structured RDF provides. On the other hand, a RDF mashup could be made by generating a sparql query on the Geospecies cite to retrieve desired RDF and merging that with the DataONE data and then a relevant hierarchical view could be created to traverse the knowledge and links.

Working with Hilmar, we decided that given the extensive links and integrations with DBpedia, the RDF mashup would be an integration of DBpedia and DataONE RDF. The two questions I will focus on are: 1) how the RDF created could be turned into an integrated query over the DataONE network and 2) how to query a cloud of data that would lead the search to DataONE data. Thus, the next demonstrations will focus on sparql queries over DataONE and DBpedia data.

The remainder of the week was spent describing more of the data in the DataONE RDF, that is, focusing more on the data content, preparing my site for the mashup and sparql queries of the DataONE data and a more complete writeup of Use Case 1 given the work done so far.

Observations

It has been helpful to me to consider DataONE in terms of member nodes, coordinating nodes and tools. Member nodes need to dereference URIs and negotiate content based on coordinating node requests or other user requests. Coordinating nodes need to perform an initial level of resolving URIs by requesting the content from specific member nodes. DataONE tools would be useful in providing views and queries of RDF data. What I have not resolved is naming. Currently I am naming the DataONE resources based on the repository I am getting these from. I don’t see how the naming would change despite the DataONE infrastructure because it seems that member nodes will be accessible to other agents, not just coordinating nodes.

What I had not really considered, due to initial concerns for context, was understanding the data more. I was able to consider all this research by focusing on the metadata about the data provided by KNB and Dryad. The issue here is that the metadata is lacking of knowledge that exists in the data and searches are limited by curation. Considering the data is important though. For example, as I looked through the Dryad data files in more detail, it was interesting to see how some data has information like county, museum, latitude, longitude. With the proper knowledge infrastructure, solely capturing longitude and latitude should be sufficient. The key here is proper infrastructure; scientists are capturing details of their work to assure the data can be understood and the proper infrastructure for relating information like longitude and latitude to other things like buildings, counties or cities, etc., is not available to them.

There are a lot of tools out there for exploring the Semantic Web and Linked Data. This is good but it is also a challenge. The tools are dynamic, often unstable, in part due to their reliance on a distributed Web, and proprietary. Certainly relying on more generic technology like RDF to express knowledge and SPARQL to query RDF, and the ability to create RDF mashups has helped me see how to integrate the DataONE data and work with it as a whole yet distributed dataset, but the proliferation of vocabularies, browsers and sparql endpoints makes the entire process of creating and using a unified dataset a challenge. I still think that focusing on contexts will hide much of this chaos from users. The question is how to hide the chaos from the software agents.

Research Links

There are resources that have been created from this research effort. In order to keep them readily available for reference with my weekly blog, I will be adding them as links.

Dryad only RDF can be found at : http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DryadData.rdf
KNB only RDF can be found at : http://rio.cs.utep.edu/ciserver/ciprojects/sdata/KNBData.rdf
ORNL DAAD RDF on its way
All DataONE RDF, in one URL, can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DataONEData. This file is just a mashup of the individual files for each repository.
Project items under source control can be found at : https://github.com/hlapp/LOD4DataONE
Documentation on the java code can be found at : http://rio.cs.utep.edu/ciserver/sites/default/files/lod4dataone/doc/index.html
An initial powerpoint slide show of the work done up to week 3 can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/udata/LOD4DataONEWeek3EX.ppsx
Powerpoint slides covering the work for Use Case 1: where DataONE data can be browsed via an RDF browser, can be found at: http://rio.cs.utep.edu/ciserver/ciprojects/udata/LOD4DataONEWeek5.pptx
Use case notes can be found at: https://notebooks.dataone.org/lod4dataone/use-cases/
Project notes can be found at: https://notebooks.dataone.org/lod4dataone/notes

Week Seven Goals

Results

Observations

Research Links

Leave a Reply Cancel reply