Use Cases – DataONE Notebooks

This page considers the usefulness of this research effort. It will take a while to get to a more complete sense of either Case 1 or Case 2 but the use cases below show the progression. These use cases are developing as a result of the research I am documenting in my notes as well as the work I am doing to evaluate the DataONE to LOD integration. By the end of this research, the use cases should give specific examples of how the DataONE cloud can link with a larger cloud and how this research effort can demonstrate this. Any input on specific examples that should be demonstrated or questions that should be answered will contribute to the final understandings of this research. Please provide comments, suggestions or questions.

From Week 5
Please see the presentation for Week 5. This discusses the progress of the use cases and this research’s findings thus far. Use case 1 has been demonstrated by generating RDF for several data sets in both KNB and Dryad. ORNL DAAC will follow. I have been able to show that I can take data from the two data repositories and generate RDF that can be browsed in two different viewers. Using the Zitgist I can show how a tool can arrange the RDF data that I created to make the RDF more useful and provide linking to other related data in the LOD. Using the Tabulator I have been able to show that I can pull up multiple datasets in one view and show how they are related based on their values and vocabulary, not necessarily because they have specific links.

Use case 2 will follow using the findings from Use case 1, as discussed in the presentation. The presentation reviews the work from my research so far.

From Week 3
So far, this research has helped me understand that regardless of what data is exposed, an important requirement for linking with any linked data cloud requires that your data expose known internal and external relationships. As a result, with appropriate RDF, a user should be able to browse data and any related data within the same system (e.g., the Dryad repository) and out (a linked data cloud). The powerpoint slides that I created and posted here describe a similar problem. Within the Dryad data repository, dryad.82 describes a publication. This dryad record is highlighted within the Dryad repository because of its relationship to Treebase. In general, the record would be the result of a search for artiodactyls, hunting or extinction, based on subject tags and scientific names associated with the data and used within the Dryad metadata. In addition, dryad.82 is related to dryad.83 which represents the record for the Dryad data and dryad.83 is related to an Excel file called PriceGittleman_2007_append.xls. In this use case, I show how I went from manually searching the multiple disconnected XML pages of dryad.82, dryad.83 and PriceGittleman_2007_append.xls to an RDF representation that I could browse using an RDF browser (dataviewer.zitgist.com).

In order to understand the data in Dryad, I had to open the URL for dryad.82 and see that it has part dryad.83. I then entered the URL for dryad.83 and saw that there is a Microsoft Excel file related to it. When I opened that Excel file, I saw that there are additional fields that might be useful for searching the content, e.g., Species. Initially, I built the RDF records for all three items. I could open this up in an RDF browser but there were clearly additional questions about the data. For example, who are the people, how do I contact the publisher, what are the links between the dryad entities and the file, etc. I then modified the RDF to make links to ‘fake’-foaf files as well as the links within the Dryad data. These initial capabilities expose one of the most important benefits of linked data, from one query, and with appropriate links, I can traverse data without having to manually open multiple windows or perform additional searches.

The use case and its solution also exposed some issues. For one, finding appropriate descriptions of data, i.e., that expose RDF, by referencing a URL is not simple. Where is this information published? One solution is to use the ODE spongers to pull in RDF data for linked URLs, not requiring them to be in RDF, but this does not guarantee what links are published which could impede the browsability of the data. Furthermore, unless the linked URLs have content from specific data definitions, e.g., DC terms, foaf, the content may not be very usable in ODE. In conclusion, I was able to produce browsable RDF content but I was limited in how far out I could go from the Dryad data space. I was, however, able to show browsing within Dryad RDF content, e.g., dryad.82, dryad.83 and PriceGittleman_2007_append.xls and describe an initial link to outside data like foaf files. I was also able to exhibit RDFizing some of the internal Excel data, i.e., Species, so it could be used in searches.

From Week 2
ODE allows for ordering the browsable data by terms and categories. It also has various views. I should be able to select some options and see relevant data from the RDF dataset. For example, I would like to query the triples that pertain to “hunting”. So, I go into ODE and use the Find interface to enter “hunting” and I get nothing. I learned from this that I am having issues in ODE and the Find of the Virtuoso tool in seeing the RDF data that I created. In fact, I believe that it is not loading the RDF data given the fact that I do not see any reference to Dryad RDF content in the Categories section. I do see dc:subject available for ordering so my conclusions are that the ODE sponger is collecting as much data as it understands and making it available in its interface.

In a very technical sense, once I am able to obtain data from the three DataONE repositories and offer them as RDF to a cloud, I should, at the least, be able to view RDF. I tried to use a Virtuoso SPARQL tool to query the RDF and obtain all triples, e.g., SELECT * WHERE { ?a ?b ?c . }, but this does not even return the 95 triples that the dataset has. To assure that there are no obvious errors with the RDF that I have created, I am using the W3C RDF Validator to see the Dryad triples and confirm that the RDF structure is correct. I also validate RDF queries in the Twinkle SPARQL tool using the Dryad RDF data. The result data from these queries are in fact the triples I created, unfortunately, this first phase of creating dumb RDF is not sufficient for the ODE.

From Week 1
There are two general cases that I want to develop throughout this research effort:
General Case 1
A user searches their local site for data and is able to find that data with links to a bigger cloud, e.g., DataONE data to GenBank related records. For this effort, I will grab specific data from the three DataONE repositories (ORNL DAAC, KNB and Dryad) and build DataONE RDF data sets. A user should be able to ask questions of the DataONE RDF and see information from the internal data that links to other data in the cloud.
General Case 2
A user searches an external site for data and is shown links that relate to DataONE data, e.g., GenBank related records to DataONE data. For this effort, I will need to find other entities in the cloud that point to DataONE data, specifically to a data set that is documented in the DataONE RDF data sets. Keep in mind that I currently have little control into what data is loaded into external sites, i.e., the cloud, aside from possibly creating some ‘fake’ data – which is unlikely. My current strategy is to build some simulated RDF of related data and link it back. If I get sufficient community feedback, I may be able to find where there are references to DataONE data that may lead a query back. This may require that I add additional data sets to those that have been selected for this research effort.

Leave a Reply Cancel reply