Integrating loosely structured data into the Linked Open Data cloud
The Linked Data conventions describe four principles that allow data of any kind and from any online source to form a global interconnected web of data. These four principles are: i) Name every “thing” that has some data or information associated with it; ii) use HTTP URIs to do so; iii) provide useful information or data in Resource Description Framework (RDF) format to someone looking up such URIs; and iv) within information provided this way, link to other common “things”, such as points or axes of reference, and use common vocabularies to attach meaning to links wherever possible. These seemingly simple principles have nonetheless been highly effective in facilitating the creation of large, globally distributed, and constantly growing aggregations of Linked Open Data (LOD). In this way, Linked Data provides a universally applicable framework for machines and users alike to integrate, navigate, and discover data by following links that are semantically of interest.
However, trying to apply the Linked Data principles to data holdings of non-specialized digital repositories, such as DataONE and many of its member nodes, is challenging. These data are often highly heterogenous, and not natively expressed in RDF, or a format structured enough that would lend itself to automatic conversion to RDF. Instead, they are typically represented in formats that are either loosely structured in an ad-hoc manner (such as spreadsheets), or according to one of a myriad of formats output by instruments or analysis programs. It is thus not clear what the universe of “things” to name is, what are common points or axes of reference, what kinds (semantics) of links are needed, and how data archived in this way can be exposed in RDF such that the conversion can be automated, yet is still useful for science-motivated discovery and integration.
Description of Work
The idea of this project is to develop an exploratory prototype, and practical recommendations resulting from it, for how the heterogeneous and loosely structured data held in non-specialized DataONE member nodes can be exposed to the Linked (Open) Data cloud. The approach would consist of obtaining a sufficiently representative sample of data sets from DataONE’s initial 3 member nodes (Dryad, KNB, and ORNL-DAAC), and using them as instance data for which to define the RDF predicate vocabularies, domain ontologies, resource URIs, and conversion mechanisms that are necessary to create a LOD representation of these data. This representation can then be uploaded to, navigated, and queried in either one of the web-based LOD browsers (such as URIburner), or for example in a local installation of OpenLink Virtuoso.