Week Two Update

Week two goals

This second week for LOD4DataONE was focused on writing code to extract data from datasets on the three repositories, creating RDF and loading the RDF triples in an RDF explorer.  Initially I was focused on applying this process to all three repositories but realistically I could only work the process through to the end with the Dryad datasets.  Not a surprise really, all three repositories work so differently with respect to getting data and metadata out.  I was able to start focusing more on using the OAI-PMH interface to extract metadata then use specific site-based tools to extract data and RDFize (create RDF triples) it.  A sample RDF file for the Dryad data can be found at http://rio.cs.utep.edu/ciserver/ciprojects/udata/dryadRDFWeek2.rdf.  I chose this location for placing the data because I have the APIs to upload the content automatically, making it easier to modify, view and share the results. I was able to open the RDF data file using the Openlink Data Explorer (ODE) Add-on in Firefox, this uses URIBurner as I had originally planned. ODE uses Virtuoso, a multi-model data server that works with relational, RDF and XML data. Given a URL, it will attempt to generate RDF via a tool called a ‘sponger’. The sponger does not require that a URL page be expressed in RDF because it tries to make sense of the page to create RDF triples. The week ended with comparing the RDF I created to the RDF created by the RDFizer in ODE and assessing the overall process with a use case. The use case can be found in the Use Cases page for this research effort.

Results

The RDFizer in ODE was able to pull a lot of information out of the Dryad metadata pages and I could see relevant data in the Categories. In total the ODE sponger, the internal ODE tool that extracts all it can from URL pages, extracts 1085 triples. When I ran certain scenarios, e.g., search for “hunting”, I am able to find several relevant records. From an initial standpoint this might be ok but I am not sure what IS in there from the default Dryad metadata page. Considering data misuse, it seems important to provide relevant data triples and assure that unexpected data is not exposed in RDF to the ODE sponger.

My RDFizing of the Dryad data generates an initial set of 95 triples. There were issues with 2 of the 6 datasets so I just focused on extracting 4 datasets successfully to complete this week’s goals. When loading this RDF in ODE, ODE does not seem to be able to read the RDF. Further testing showed that the RDF I generated had no errors in the W3C RDF Validator and I was able to query the data in a separate RDF SPARQL query tool called Twinkle SPARQL Query which uses a different RDF triplestore. In the end, by using the online Virtuoso Query Tool to load the RDF file I determined Virtuoso can make no use of the RDF I have generated. I can not run a simple query, e.g., select * where { ?a ?b ?c . } which should return all triples.

Observations from week two:

  1. In retrospect, it helps to have stepped through the process with one data repository, i.e,, Dryad.  That learning process should help me with the other two.  Nevertheless, the process Dryad has made accessing the data I needed for this research easier.  For example, certain datasets that I selected through the KNB search tool were not so easy to find using the Metacat APIs.  Similar with ORNL-DAAC.  On the other hand, I found it simpler to search the OAI-PMH tool for Dryad, given their examples on their website, which in turn assured that I could find those datasets.  Since their examples came with links, I could also compare results.  Now that I have a process from data repository to Openlink browser ironed out for the Dryad dataset, I can focus on getting the remaining 6 datasets from KNB and ORNL-DAAC.
  2. This first phase of RDF extraction was kind of the ‘dumb’ phase: get data and produce valid RDF.  Nevertheless, it was still important to understand the data.  As I combed through the datasets to understand what should be returned and what I should RDFize, I saw key points for linking the internal Dryad data.  For example, using dc:relation.haspart values to go get the RDF triples for the related handles.  By the same token, I learned that the interface and the API do not always coincide.  An example, when viewing the full metatdata for a record on Dryad, the dc:identifier.uri has the Web accessible handle for a Dryad resource.  I initially used this to reference the URI in the ‘about’ field.  When I dump the Dryad record programmatically, I found that the value was really in the dc:identifier field and that there were many dc:identifier fields.  Thus, I could not use it – this affected the first RDF dataset but it was not a major issue; it just modified how I determined the resource’s URI.
  3. It is great that there are tools to help build and use the Semantic Web, e.g., linked data seen on ODE.  Unfortunately, ODE provides a lot of information that does not seem too useful to non-technologists.  I, as a technologist, found them unclear.  This is expected though.  ODE and tools that understand and leverage semantic data, in RDF, are barely grasping the needs to make the Semantic Web useful.  It seems clear that not only must DataONE understand how to provide their data in RDF, with appropriate internal links as well as external links, DataONE will need to customize their Semantic Explorer views.  In my comparison, I see where appropriate ‘spongers’ must be written to handle the client data extraction and data views must be created to support appropriate meaning and context in the data, making the data more understandable and useful to scientists.  The nice thing is that with the Firefox Add-on concept and the ability to build ‘spongers’, the whole idea seems quite doable given the current OpenLink effort.  Note: ODE is also available on Google Chrome and Safari.

2 Replies to “Week Two Update”

  1. I was interested in your comment about the divergence between Dryad API and the web view of the metadata. If you encounter any such discrepancies that may need fixing, please don’t hesitate to report them to help@datadryad.org so we have a record in the queue to investigate when time permits.

    • Thanks. I will look into why I think this is. One thing I believe is happening is that some fields,e .g., dwc:ScientificName is not being converted to a dc: counterpart. The other may be a lack of experience on my part, I don’t see how to get the specializations of dc:relation or dc:identifier, for example. I will be posting a question about that on dryad-dev this week. Maybe someone else has ideas.

Leave a Reply

Your email address will not be published. Required fields are marked *

*