Week Ten Update – Final Update

Week Ten Goals

The goals for week 10, being the final week of the LOD4DataONE internship, were to finish off the details for a SPARQL query demonstration using the RDF data. In addition, all code, documents and RDF was to be collected on the github project page. Links to the query page and github are found below.

Results

The demonstration is via a Webpage, found at http://manaus.cs.utep.edu/lod4d1, running on a Drupal server. The Webpage, implemented through a Drupal module, describes the project and queries that relate to the two use cases for the project. The queries search over the DataONE RDF that was created for this project as well as two remote stores: Data.gov and DBpedia. Answering the queries required a synchronization of data, requiring that I modify the RDF that I had for the three repositories so that RDF could be found when expected, with a single query. In some cases, the results don’t get everything I want but this is due to the different ways that data is expressed through the three repositories. I would expect this to be the case with any effort to combine data, I also find this to be an inherent issue with multiple sources of data.
In order to support Use Case 1, I created a series of queries that show how to find data across the three DataONE member repositories. The result data can then be browsed via an RDF browser, as was done with the examples throughout this research.
In order to support Use Case 2, I created queries that extract data from Dbpedia, data.gov and the DataONE repositories, creating a mashup of RDF that could then be browsed. One thing I had to consider when accessing remote stores was whether I wanted to duplicate the RDF data in the LOD4DataONE triple store for querying or if it made sense to limit the data to only what was needed for the particular mashup. Either way, using generic queries for accessing data from the remote stores generated too many triples for my process thus I had to limit them to some random number, e.g., 1000. As a result, I provided sample queries that extracted more focused data, i.e., pulling in less RDF from the remote stores by limiting the data with a SPARQL filter vs a random number. It seems that as the data over the Semantic Web increases, configuring stores and queries will be a norm for accessing the immense amounts of content that will be available.
At the bottom of the demonstration Webpage are several queries that I tested over the last two weeks.
Finally, all php code for the Drupal module, java code for the RDF generation, Powerpoint presentations, javadoc and RDF are available on the github wiki, along with a Readme file.

Observations

  • It seemed difficult to manage the changes to the RDF I was creating. As more information was needed, the RDF changed and as a result I was changing code and RDF up until the last day of this internship; in these last two weeks, changes were often needed to relate and extract certain data for the queries, e.g., dbpedia:species was used to relate data to dbpedia.org. These changes made me wonder what I would have done if one file changed out of sync with another, e.g., causing a link to be lost. It seems to me that versioning of RDF data, in particular linked RDF, will more than likely be a major concern for DataONE, if it is not already.
  • As I used the remote stores to extract data, one thing that I noticed were inconsistencies in the stores that caused me to handle RDF differently across them. For example, different SPARQL rules or different RDF result sets from a query. In some cases I could resolve these issues with generic coding rules but in others I had to use specific code for different remote stores. As the Semantic Web gets larger these idiosyncrasies will become bottlenecks in the ability to seamlessly integrate data.
  • Clearly, in terms of the Semantic Web and linked open data, I think that DataONE needs to focus on providing more data as RDF, something Dryad recently proposed to implement for their data. Notice, there are other structures that can be used, in general DataONE would benefit from supporting common content negotiation techniques. The data also needs to be available as dereferencable URs. Furthermore, there needs to be consistency in data access. One of my biggest bottlenecks was the different methods to access metadata and data from the different repositories. DataONE member nodes are working to provide similar access via a DataONE API. This may work across the DataONE entities but it seems to limit access outside of DataONE. In particular for searching and linking of data, a SPARQL endpoints would help in making DataONE data more openly available to more users AND software agents. In addition, when capturing and/or curating data, DataONE could help users by providing mechanisms to embed links to related information or, in order to avoid duplication, links to the source of information. For example, the soil respiration database published at ORNL-DAAC is actually a copy. At some point, the authors were forced to make a copy as opposed to pointing to the source. When accessing RDF, DataONE could help by providing contextual views for viewing the data. For example, when Zitgist sees location vocabulary, a map template is displayed showing the location data on a map. Zitgist does not have any additional knowledge or ability to handle such data therefore when some DataONE datasets have multiple locations, Zitgist does not have a better understanding. The DataONE repositories have better knowledge of what the Earth data is about and could provide useful views for searching and viewing related data and metadata, across all member node datasets. Moreover, DataONE could provide insight into useful vocabulary to support the management of data, datasets and related publications about scientific data; these are the things that are managed at the three repositories. In this research I chose not to use Dublin-Core to describe an article so I created a new type; if I stayed with the Dublin-Core vocabulary, I felt forced to choose dcmi:text to describe both an article and a file. By providing use cases, useful vocabularies, example implementations and sample queries, more users will understand how to describe their data in a way that might promote more interoperability. I found that my decisions were sometimes based on a default of what I had, uncertainty of what I found or exhaustion of searching. Finally, in terms of the data itself, DataONE could help with defining the mechanisms to maintain the context of the data when producing RDF. For example, documenting provenance in RDF already helps is maintaining a link between the RDF and specific data. There are already tools that are converting ALL content of a file to RDF triples. The implication, as noted in earlier conversations within the internship, is a lack of context, e.g., when only a subset of data is returned or when a vocabulary is misunderstood. It seems to me that this is an unavoidable issue that DataONE will need to address.

Research Links

Leave a Reply

Your email address will not be published. Required fields are marked *

*