This week the LOD4DataONE research was formally started. During the last two weeks I performed initial research to select tools and a development platform that should help me keep the research objectives on schedule.
This week my focus was on selecting the specific datasets and related RDF structure to use from the three DataONE repositories (KNB, Dryad, ORNL-DAAC). I selected three datasets from each repository to extract data from. The challenge here was understanding how to get to datasets that were useful to this research effort. For example, data that was accessible and in a text based structure or comma delimited. This required understanding how the different platforms worked, the types of data they contained and the mechanisms to get to the data, in particular in an automated fashion. Each repository is quite different in data access so this was an initial challenge. I have defined an automated data extraction plan and started building the structure in Java. In addition, I considered the RDF vocabularies to use to represent the data and I have devised an RDF plan for each dataset. The data selected has been selected more for its potential integration with other data, in particular from the Linked Open Data (LOD) cloud, than anything else. As a result, the RDF is ad-hoc. I will be evaluating this with the dryad-dev community in the upcoming weeks. I will also be updating the data use cases as I move along and understand them better.
My observations for this first week are:
1) Understanding the goals of the three repositories that have such different mechanisms to access data, exposed three entirely different worlds for automating data access and extraction. I have been documenting these differences and I hope to be able to identify how each repository helped or challenged my research with their infrastructure. One important note is that I am considering how I would access this information generically across the three repositories. The Java classes I have written to interface with each repository are all extending a base class. My theory is that the more code I can add to the base class, the more generic functionality there is across the three repositories. I will always try to add functionality to the base class before I add to each specific class. At this point though, more needs to be added to the specific repository classes.
2) I have learned that there has already been a lot of work within each repository to expose import meaningful information from the data, as structured metadata. Mainly, to make the data discoverable. This information was important and helpful in me understanding possible links to the LOD cloud and it seems that, for the most part, the information comes from scientists as they publish their data. As a result, using this data and considering how I can retrieve it from the data itself as well as considering supplementing it with additional data as I find it useful to a LOD cloud integration is my current course for this research.
3) Although the goal for this research effort is to expose internal data from datasets to make the datasets discoverable over the LOD cloud, I learned that I need to be cautious with how this data is made available. The biggest concern here is maintaining the context of the data to avoid its misuse; exposing internal data loosely, without maintaining its relevance to its original research and dataset, could offer incorrect or insufficient information. Thus, a challenge that has surfaced is: how do I expose useful data that is necessary for linking to an LOD without losing the context of the data.
Please provide any comments or feedback.