Mentor Plan

LOD4DataONE Summer Intern Mentor Plan

Intern: Aída Gándara: a doctoral student from The Department of Computer Science at The University of Texas at El Paso and a research student at Cyber-ShARE.

Primary Mentor: Hilmar Lapp: from The National Evolutionary Synthesis Center (NESCent).

The Linked Open Data DataONE Summer Internship Mentor Plan below was conceived as a tentative plan for the duration of the internship.  We expect this plan to become increasingly different as the prototype changes and community ideas are incorporated into the progress. This plan will be updated weekly. An original version of this plan can be found on Google Docs.

Preliminary Research

Preliminary research will establish infrastructure for the automation process including programming platform, initial browsing technology, demo-site, development libraries and tools that will be used to implement the prototype. Overall, this plan focuses on  :

  • Extracting DataONE datasets from the Dryad, KNB and ORNL-DAAC repositories  and exposing them as RDF,
  • Reconciling data within the datasets to authoritative sources, considering  common sources, and
  • Merging the DataONE RDF data into a larger data cloud like the Linked Open Data Cloud.

Preliminary Research Update
Java seems to be the most appropriate language to write the prototype in. URIBurner seems to be the best candidate for loading the RDF and using it as a browser of the data. I will use my research website to house the browser, mainly because I have the permissions I need to set this up for demoing the results. A link will be placed on the lod4dataone DataONE notebook to make it easy to access the prototype. In addition to Java, Jena will be used to build the RDF and the needed libraries to access the three repositories will be used. For now, this only seems to be the Metacat libraries used to access data on the KNB repository. For Dryad I will use the OAH-PMH Web services and the METS page to obtain data. For ORNL-DAAC it looks like I will have to browse pages from their ftp data repository. The LOD4DataONE GitHub repository will be used to store all software created for this project.

Week 1 (Jun 6th – Jun 10th )

Project Activities: Focus on selecting data and RDF structure for DataONE dataset repositories (KNB, Dryad, ORNL-DAAC)

  • Identify ‘best’ data to extract from repositories.
  • Identify RDF vocabularies /structures (either ad-hoc or predefined) for the different data.  Currently not necessarily considering the same vocabulary.
  • Define use-cases and extraction/RDF plan for the different datasets, e.g., which fields, why and why not
  • Update LOD4DataONE DataONE notebook
  • Email interns & mentors

Development opportunities for the intern: Identifying data knowledge contacts and learning how to apply RDF to scientific data in the DataONE repositories
Expected Outcomes: List of initial set of datasets (at least 3 from each), vocabularies, and use-cases that will be included in the prototype.
Completed?: Yes

Week 2 (Jun 13th -Jun 17th)

Project Activities: Focus on extracting data and generating RDF

  • Automated extraction of datasets from  3 repositories.==> Update: focused just on Dryad due to differences in extraction tools
  • Generate RDF based on recommendations in week 1. ==> Update: created RDF.
  • Make RDF browsable  through RDF browser. ==> Update: not browsable in first browser used (ODE)
  • Establish site for community access to browser/RDF ==> Update: placing my content on a Drupal server I have
  • Update LOD4DataONE DataONE notebook
  • Email interns & mentors

Development opportunities for the intern: Understand how to extract from different scientific source repositories as well as the challenges in making scientific data browsable.
Expected Outcomes: Extracted data accessible in RDF and browsable via the web. => Update: extracted but not browsable. Browsers not as straightforward in RDF world.
Complete?: Partially ==> Update: All data can be browsed using default Openlink Data Explorer Add-on but not using the RDF I created. RDF data I created only for Dryad and ODE does not load it correctly.

Week 3 (Jun 20th – Jun 24th)

Project Activities:Evaluate research effort and data with DataONE community

  • Iron out issues with ODE to make the RDF created browsable ==> Update: there are supported spongers for ODE. The RDF I created is using a Dryad based DCType defintion. ODE does not recognize it. I was able to use zitgist.com which is a generic RDF browser.
  • Evaluate what RDF is created by sponger in ODE ==> Update: ODE seems to be tightly integrated with specific sites, not sure what to expect in getting plain RDF in. Wondering how might the community see the DataONE integration?
  • Discuss content extracted and published in RDF. ==> Update: Created a powerpoint presentation showing what data was extracted. Updated a use case for this week 3.
  • Create project demo ==> Update: created presentation in powerpoint describing the changes to RDF to support browsing.
  • Complete KNB & ORNL DAAC data extraction to RDF ==> Update: could not address this.
  • Identify project next steps from community responses. ==> No responses yet. Need to focus on the rest of DataONE repositories and search for URLs that support RDF links.
  • Update LOD4DataONE DataONE notebook ==> Update: done
  • Email interns & mentors ==> Update: done

Development opportunities for the intern: Understand the RDF that is and should be available for scientific data.
Expected Outcomes: Identify next implementation steps for data reconciliation.
Complete?: Mostly ==> Update: feedback not as interactive as expected. Still many questions to answer, e.g., how to link data, what RDF vocabularies to use, how to leverage RDF browsers

Week 4 (Jun 27th – Jul 1st)

Project Activities: Focus on reconciliation of data with authoritative sources.

  • Extract KNB & ORNL DAAC data to RDF ==> Update: can extract KNB. Real challenge is the appropriate vocabulary for better integration.
  • Identify content to be reconciled and the ‘best’ source, getting RDF from related links. ==> Update: Found ontologies to represent different aspects, most not browser compatible at this point. e.g. zitgist could view foaf data, tabulator could see that the rdf loaded location information. Neither could view graphic data from rdf.
  • Rebuild RDF. ==> Update: I was able to do this. Found that new vocabulary would easily break browsers.
  • Understand how DataONE fits in linked data clouds – producing and using data – add blog post. ==> Update: Must select not only good representation but also browser compatible.
  • Show browsing of all three data repositories and some external links. ==> Update: browsing still sensitive to minimal changes. I need to find some level of stability to show this.
  • Update LOD4DataONE DataONE notebook. ==> Update: done
  • Email interns & mentors. ==> Update: done

Development opportunities for the intern: Learning how to integrate RDF datasets to common authoritative sources.
Expected Outcomes: Data linked with outside sources ==> Update: able to link but data not always visible in useful views.
Complete?: Partially. ==> Extracting data, linking to internal and external sources, e.g., FOAF or hasPart type records but viewers not showing it. Can run RDF queries but not enough, need to understand these RDF browsers that show maps, calendar and timelines to make the point of usefulness of the structured data.

Week 5 (Jul 4th – Jul 8th)

Project Activities: Integrating DataONE data with browser knowledge

  • Find map and time ontologies that both work with at least one RDF browser, preferrably ODE. ==> Update: was able to get GEO and date to work with Tabulator and Zitgist
  • Pull out DataONE time and location data to make it browsable within a browser. ==> Update: KNB data has latitude and longitude was able to use that and plot points. Dryad data has location data but would need to convert it.
  • Provide demo about use case 1 ==> Update: not complete
  • Update LOD4DataONE DataONE notebook. ==> Update: done
  • Email interns & mentors. ==> Update: done

Development opportunities for the intern: Learning how to integrate RDF data with other/source data and how to identify and choose valid/useful data sources and vocabularies.
Expected Outcomes: Data linked to other sources of data.
Complete?: Partially ==> still need to complete demo.

Week 6 (Jul 11th – Jul 15th) – Midterm evaluations.

Project Activities: Focus on Use Case 2 – How to search for DataONE data from other data.

  • Complete demo/presentation. ==> Update: done.
  • Extract ORNL DAAC data ==> Update: grabbing some location data.
  • Identify axes of integration – LOD members and data that relates. ==> Update: location and another attribute like scientific name, author location or date.
  • Identify context to unify data ==> Update: Tabulator map at first. Hierarchy if possible
  • Update LOD4DataONE DataONE notebook. ==> Update: done.
  • Email interns & mentors. ==> Update: done.

Development opportunities for the intern: Learning how to integrate RDF data with other/source data and how to identify and choose valid/useful data sources.
Development opportunities for the intern: Learning how to integrate RDF datasets on a bigger cloud.
Expected Outcomes: Identify next implementation steps for cloud integration.
Complete?: in progress

Week 7 (Jul 25th – Jul 29th)

Project Activities: Focus on the bigger cloud.

  • Extract ORNL DAAC data ==> Update: Line and Giri have sent additional comments on this. Will try Giri’s examples.
  • Identify how to link external data to DataONE data ==> Update: will use Dbpedia as the source of data
  • Rebuild RDF. ==> Update: in progress, add some knowledge from other areas, focus on loose data for finer search
  • Make infrastructure changes ==> Update: focus more on information scientists – sparql endpoint being implemented
  • Update LOD4DataONE DataONE notebook. ==> Update: done
  • Email interns & mentors. ==> Update: done

Development opportunities for the intern:Learning how to integrate RDF datasets on a bigger cloud and identifying useful cloud features for integrating RDF data.
Expected Outcomes: DataONE data accessible from a bigger cloud, e.g., Linked Open Data Cloud.
Complete?: in progress ==> will focus on query for information scientists and queries across DataONE datasets. ORNL DAAC input did not show much promise for additional data. Will try ideas sent but the data I am grabbing seems good for demonstration and search purposes.

Week 8 (Aug 1st – Aug 5th)

Project Activities: Focus on the bigger cloud.

  • Regenerate DataONE RDF. ==> Update: progress, file almost complete
  • Build a sparql query tool with DataONE data. ==> Update: done, http://manaus.cs.utep.edu/ARCquery
  • Demonstrate queries. ==> Update: almost complete, need new RDF to load correctly
  • Update LOD4DataONE DataONE notebook. ==> Update: done
  • Email interns & mentors.==> Update: done

Development opportunities for the intern: building a sparql query interface and queries into DataONE.
Expected Outcomes: retrieve RDF about DataONE data
Complete?: Almost ==> cleaning up some RDF issues for all the data.

Week 9 (Aug 8th – Aug 12th)

Project Activities: Access DataONE data from an RDF Mashup w/DBpedia or Data.gov data

  • Pull DBpedia or data.gov and DataONE RDF into an sparql tool. ==> Update : done. Grab some minimal data about species and lat and long
  • Demonstratre the hierarchical query – general DBpedia/data.gov to DataONE record to Dryad/KNB/DAAC record. ==> Update: workng on this website has the data, now need to show the data relationships.
  • Discuss closing of research effort. ==> Update: to do
  • Identify final steps. ==> Update: sent high level list to Hilmar
  • Update LOD4DataONE DataONE notebook. ==> Update: done
  • Email interns & mentors. ==> Update: done

Development opportunities for the intern: Alignment of research with DataONE and LOD community.
Expected Outcomes: Steps to close research.
Complete?: Mostly. Hilmar and I will be discussing the close of research which will include the final steps.  I will be finishing up the webpage with the queries.

Week 10 (Aug 15th – Aug 19th)

Project Activities: Documentation and project completion.

  • Implement steps to close research. ==> Update: done: finished query page, updated code and rdf, published all to github.
  • Complete Usecase demonstration and query page ==> Update: done.
  • Complete Code; Submit final verison to GitHub. ==> Update: done
  • Update LOD4DataONE DataONE notebook. ==> Update: done
  • Email interns & mentors. ==> Update: done
  • Development opportunities for the intern: Clearer understanding of the results of the project and will have documentation of internship work.
    Expected Outcomes: Final: Demo, Documentation & Code
    Complete?:Yes. Will collect lessons learned and present at final meeting and possible publication
  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    *