{"id":860,"date":"2011-06-18T08:23:10","date_gmt":"2011-06-18T14:23:10","guid":{"rendered":"http:\/\/notebooks.dataone.org\/lod4dataone\/?p=59"},"modified":"2013-05-17T20:43:28","modified_gmt":"2013-05-17T20:43:28","slug":"week-two-update","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/linked-data\/week-two-update\/","title":{"rendered":"Week Two Update"},"content":{"rendered":"<h2>Week two goals<\/h2>\n<p>This second week for LOD4DataONE was focused on writing code to extract data from datasets on the three repositories, creating RDF and loading the RDF triples in an RDF explorer.\u00a0 Initially I was focused on applying this process to all three repositories but realistically I could only work the process through to the end with the Dryad datasets.\u00a0 Not a surprise really, all three repositories work so differently with respect to getting data and metadata out.\u00a0 I was able to start focusing more on using the OAI-PMH interface to extract metadata then use specific site-based tools to extract data and RDFize (create RDF triples) it.\u00a0 A sample RDF file for the Dryad data can be found at <a href=\"http:\/\/rio.cs.utep.edu\/ciserver\/ciprojects\/udata\/dryadRDFWeek2.rdf\" target=\"_blank\">http:\/\/rio.cs.utep.edu\/ciserver\/ciprojects\/udata\/dryadRDFWeek2.rdf<\/a>.\u00a0 I chose this location for placing the data because I have the APIs to upload the content automatically, making it easier to modify, view and share the results. I was able to open the RDF data file using the Openlink Data Explorer (ODE) Add-on in Firefox, this uses URIBurner as I had originally planned. ODE uses Virtuoso, a multi-model data server that works with relational, RDF and XML data. Given a URL, it will attempt to generate RDF via a tool called a &#8216;sponger&#8217;. The sponger does not require that a URL page be expressed in RDF because it tries to make sense of the page to create RDF triples. The week ended with comparing the RDF I created to the RDF created by the RDFizer in ODE and assessing the overall process with a use case. The use case can be found in the <a href=\"http:\/\/notebooks.dataone.org\/log4dataone\/use-cases\">Use Cases<\/a> page for this research effort.<\/p>\n<h2>Results<\/h2>\n<p>The RDFizer in ODE was able to pull a lot of information out of the Dryad metadata pages and I could see relevant data in the Categories. In total the ODE sponger, the internal ODE tool that extracts all it can from URL pages, extracts 1085 triples. When I ran certain scenarios, e.g., search for &#8220;hunting&#8221;, I am able to find several relevant records. From an initial standpoint this might be ok but I am not sure what IS in there from the default Dryad metadata page. Considering data misuse, it seems important to provide relevant data triples and assure that unexpected data is not exposed in RDF to the ODE sponger.<\/p>\n<p>My RDFizing of the Dryad data generates an initial set of 95 triples. There were issues with 2 of the 6 datasets so I just focused on extracting 4 datasets successfully to complete this week&#8217;s goals. When loading this RDF in ODE, ODE does not seem to be able to read the RDF. Further testing showed that the RDF I generated had no errors in the <a href=\"http:\/\/www.w3.org\/RDF\/Validator\/\">W3C RDF Validator<\/a> and I was able to query the data in a separate RDF SPARQL query tool called <a href=\"http:\/\/www.ldodds.com\/projects\/twinkle\/\">Twinkle SPARQL Query<\/a> which uses a different RDF triplestore. In the end, by using the online <a href=\"http:\/\/docs.openlinksw.com\/virtuoso\/querytools.html\" target=\"_blank\">Virtuoso Query Tool<\/a> to load the RDF file I determined Virtuoso can make no use of the RDF I have generated. I can not run a simple query, e.g., select * where { ?a ?b ?c . } which should return all triples.<\/p>\n<h2>Observations from week two:<\/h2>\n<ol>\n<li>In retrospect, it helps to have stepped through the process with one data repository, i.e,, Dryad.\u00a0 That learning process should help me with the other two.\u00a0 Nevertheless, the process Dryad has made accessing the data I needed for this research easier.\u00a0 For example, certain datasets that I selected through the KNB search tool were not so easy to find using the Metacat APIs.\u00a0 Similar with ORNL-DAAC.\u00a0 On the other hand, I found it simpler to search the OAI-PMH tool for Dryad, given their examples on their website, which in turn assured that I could find those datasets.\u00a0 Since their examples came with links, I could also compare results.\u00a0 Now that I have a process from data repository to Openlink browser ironed out for the Dryad dataset, I can focus on getting the remaining 6 datasets from KNB and ORNL-DAAC.<\/li>\n<li>This first phase of RDF extraction was kind of the &#8216;dumb&#8217; phase: get data and produce valid RDF.\u00a0 Nevertheless, it was still important to understand the data.\u00a0 As I combed through the datasets to understand what should be returned and what I should RDFize, I saw key points for linking the internal Dryad data.\u00a0 For example, using dc:relation.haspart values to go get the RDF triples for the related handles.\u00a0 By the same token, I learned that the interface and the API do not always coincide.\u00a0 An example, when viewing the full metatdata for a record on Dryad, the dc:identifier.uri has the Web accessible handle for a Dryad resource.\u00a0 I initially used this to reference the URI in the &#8216;about&#8217; field.\u00a0 When I dump the Dryad record programmatically, I found that the value was really in the dc:identifier field and that there were many dc:identifier fields.\u00a0 Thus, I could not use it &#8211; this affected the first RDF dataset but it was not a major issue; it just modified how I determined the resource&#8217;s URI.<\/li>\n<li>It is great that there are tools to help build and use the Semantic Web, e.g., linked data seen on ODE.\u00a0 Unfortunately, ODE provides a lot of information that does not seem too useful to non-technologists.\u00a0 I, as a technologist, found them unclear.\u00a0 This is expected though.\u00a0 ODE and tools that understand and leverage semantic data, in RDF, are barely grasping the needs to make the Semantic Web useful.\u00a0 It seems clear that not only must DataONE understand how to provide their data in RDF, with appropriate internal links as well as external links, DataONE will need to customize their Semantic Explorer views.\u00a0 In my comparison, I see where appropriate &#8216;spongers&#8217; must be written to handle the client data extraction and data views must be created to support appropriate meaning and context in the data, making the data more understandable and useful to scientists.\u00a0 The nice thing is that with the Firefox Add-on concept and the ability to build &#8216;spongers&#8217;, the whole idea seems quite doable given the current OpenLink effort.\u00a0 Note: ODE is also available on Google Chrome and Safari.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Week two goals This second week for LOD4DataONE was focused on writing code to extract data from datasets on the three repositories, creating RDF and loading the RDF triples in an RDF explorer.\u00a0 Initially I was focused on applying this process to all three repositories but realistically I could only <a class=\"more-link\" href=\"https:\/\/notebooks.dataone.org\/linked-data\/week-two-update\/\">Continue reading <span class=\"screen-reader-text\">  Week Two Update<\/span><span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":22,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[113],"tags":[],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/860"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=860"}],"version-history":[{"count":3,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/860\/revisions"}],"predecessor-version":[{"id":1110,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/860\/revisions\/1110"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=860"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=860"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=860"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}