Week Six Goals
I had a few goals for this week. First, now that I am able to see some DataONE data in both the Zitgist and Tabulator browsers, I started to feel like I could close off Use Case 1. To do this I felt it useful to create a powerpoint demo of my work so far. I was also starting to focus on Use Case 2. This requires that I 1) understand what vocabulary I should use to work my way from some related DataONE data back to the datasets I chose and 2) select some platform to display the results on. To me, this part is the most challenging but it is also the most important.
Results
This week I was able to complete a powerpoint slide presentation from all the work I have done so far. In this presentation I discussed the KNB and Dryad datasets, the vocabularies used and the RDF browsers that I am using. Although I feel I need to perform a second pass on this presentation it was good for a start. Doing this helped me iron out details in the code that I had roughly worked through in the weeks leading to week 6. This code can be found on GitHub at the LOD4DataONE repository. I have generated a Javadoc for the sole purpose of giving an overview of the code. The Javadoc can be found here.
I am grabbing ORNL DAAC data from ftp files. It is not the optimal way and in fact I was hoping to get more metadata to go with the files that I had. I only had a little time to focus on this last week so I will work a little more with trying to grab metadata along with the data in week 7. The positive note to this is that the data that I chose has location data and I feel that it can be used to work with Use Case 2.
The one unifying axes of integration that I will consider for Use Case 2 is location. I still have some work to do because Dryad data, for example, is using a location description that I will need to convert to the WGS84 format. The ORNL DAAC data has longitude and latitude data and the KNB data has longitude and latitude defined in the EML data. I should then be able to produce a single view where a region is selected and the DataONE data is found. Currently, this is not a very fluid view, zitgist shows this separate for each data record and tabulator requires that I select each data separately as individual queries. A lack of providing more useful views of linked data is actually one observation that Jane Greenberg identified in her ALA conference visit.
Other views for consideration are a Taxonomy view that could show species then link them to the DataONE data. The species data is not programmaticaly available to me so I would need to inject that information for the sake of demonstration. Providing a more useful view and considering the vocabulary that will support access to all this linked data will identify the specific steps for implementing Use Case 2.
I have a few development platform options for implementing Use Case 2. I will be using the datasets generated by Use Case 1 although I will need to enhance them and inject demonstrational RDF data. I am currently editing the Tabulator code to create a new data view but will fall back on Java and Jena Semantic Framework to create a specific view, if I am not able to work within the framework of an existing browser, i.e., Tabulator.
In an effort to assure I did not lose details from all the suggestions and input that others have shared with me (thank you Todd, Ryan, Jane and Hilmar), I went through various emails and websites related to this research. I added notes to the LOD4DataONE notes page from these emails and have considered how they will affect the final phase of this research.
Observations
- As I am trying to understand relevant vocabularies and reviewing various LOD sites and emails, I see how vocabulary is important for linked data but not an easy answer for all the technologists that are trying to align data. Originally, I was very concerned with what vocabulary I should use and although it is an important question, I see how it is directly affected by what is to be viewed. For example, the tabulator looks for spatial properties to display map data. This means that it is not just important to define structure to data in a linked data cloud, but it is important to choose vocabularies that enable understanding through relevant viewers.
- Other relevant Linked Data or RDF research efforts should and will be considered in this research effort. The Dryad Oxford group, for example, is generating RDF from Dryad data. As a result of these emails and to provide some integration of ideas, I will be adding the DataCite vocabulary to the DataONE RDF that I have created. Other ideas align nicely, for example the Library Linked Data Incubator Group lists many use cases that are choosing to use certain vocabularies. DCTERMS, DCMI, SKOS, FOAF, are a few of the more common ones. Some of these are already being used in the RDF data I have created. As I add data to complete this second Use Case, I will consider these vocabularies first in order to highlight the relevance to these different research efforts.
- One benefit to RDF is the ability to structure data in any way and then add properties. One key feature that I felt was useful in both Zitgist and Tabulator was the fact that although I was creating my own type because I could not find an adequate type definition for the DataONE data relationships, I could still use the browsers to leverage the dcterm, foaf and wgs84 properties defined on my types. This meant that I could leverage qualities within the specific objects I was creating to support DataONE.
- One question people may have is how to become part of the LOD cloud that is being published. This was a question I had and riginally I thought there was a requirement to publish data but since then I have determined that what is being published are the links. That is, the links from DataONE to other cloud entities. Please refer to LOD documenation on how DataONE can be part of the LOD cloud.
Hi Aรญda – could you post the URLs to the RDF you have generated for the ORNL DAAC datasets here too? Several from the DataONE Data Semantics and Integration WG were asking for that to be able to play with them.
Also, it occurs to me that we should archive earlier versions so that we can reproduce the incremental improvement. So perhaps the RDF datasets should be committed to the git repo as well? Or does the UTEP server support versioning?
Certainly, the KNB dataset RDF is :
The Dryad dataset RDF is: http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DryadData.rdf
The Dryad dataset RDF is: http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DryadData.rdf
and these two datasets together are found in:
The Dryad dataset RDF is: http://rio.cs.utep.edu/ciserver/ciprojects/sdata/DataONEData
notice that this was a quick and dirty merge, not considering a unified name structure.