Notes

In an effort to understand the 3 repositories for this research I began to collect some notes. In particular they are based on some questions that I asked as I started which were formulated as a result of the overall goals of this project. Feel free to provide some input. I can break away a conversation on any of these, as a blog discussion, if you would like to develop them further. You can request this as a post here and I will start it on the blog page.

I am updating these as the weeks progress; this research requires that I go over many of these questions iteratively to iron out the details of this research. Apologies if these notes seem incomplete, they are a result of the notes I take as I use tools, search through DataONE repositories and search the Web for related references/technologies. They may also be in need of severe editing, I will do this as I go along. Originally, I wrote these in Word where I could attach footnotes with references and the document had references cited throughout, but they did not carry over to this page. References are below.

These notes are being used to help me

  1. understand the repositories and how they might be accessed separately to produce similar data
  2. understand what my options are for accessing the data selected for this research and for generating RDF and external LOD links
  3. further understand the use-cases that emerge for this project

Dryad Metadata Application Profile
Email about this sent by Jane Greenberg –
In general it seems to me that the DataONE community is working through the same questions. In fact, to me, more importantly, the specific vocabulary is not as important as whether the selected vocabulary or vocabularies describe the needed data correctly and can be leveraged to search, understand and reuse data.
Jane sent me this link about the Metadata Application Profile version 3 with some background that exhibited this same idea. In what I have found, leveraging (e.g. search, view) the RDF data will occur through tools that are able to leverage common terms like DC Terms (data, author, title) or WGS84 for location.
The Dryad Metadata Application Profile is based on the dublin core metadata initiative abstract model (DCAM) intended to conform to the Dublin Core Singapore Framework.

metadata schemes/ vocabularies:
bibo http://biblioontology.com/
dcterms http://dublincore.org/documents/dcmi-terms
dryad http://datadryad.org/metadata
dwc http://rs.tdwg.org/dwc/index.htm

Much of the dcterms are what I got from the OAI-PMH api but I did not get bibo, dryad or dwc data. DWC has the scientific name so I would have liked that one.

The DataONE API implementation does give me access to some of the fields that are missing in the dc results for the OAI-PMH services.

Some terms I want to play around with are:
dcterms:spatial/Spatial Coverage
dcterms:temporal/Temporal Coverage
Spatial needs to converted though, because the strings do not align with any geo strings.

I do not find the dcterms:hasPart on the OAI-PMH dc calls like mentioned. I only find dc:relations then I have to figure it out. In fact, I have two different methods because dcterms:relations is cyclic between publication data and package data.

The entire description seems pretty normal in the fact that Dryad plans to use multiple vocabularies and ontologies. This is the only way to really express knowledge in a way that will hold relation with other things.

Hive
Hilmar sent me information on this. HIVE, helping interdisciplinary vocabulary engineering, – a model for dynamically integrating multiple controlled vocabularies.

HIVE seems like it might play an important role in content negotiation, where requests are made about data and specific vocabularies are requested or entered for searching.

ALA Conference
Jane Greenberg sent a summary of two conferences she went to. Interesting comment with respect to Library Linked Data:
In particular people expressing frustration with the lack of applications for adequately displaying linked data, the labor intensive cost of creating LLD, and registry short-comings.

These fall in line with what I have found in understanding how to use RDF browsers with different levels of rdfization of Dryad and KNB data.

schema.org
email sent by Jane Greenberg
Launch of schema.org will support microdata. Normally focsued on microdata, microformats and RDFa.
Provides a collection of schemas, i.e. html tags, that webmasters can use to markup their pages in ways recognized by major search providers.

chembliacs
Email sent from Todd Vision.
Kasabi makes RDF data available by Talis. Makes data available and can have data hosted as dereferencable resources. SPARQL endpoint. Can access data through API, idea is to augment RDF with data from a dataset. Augmented reality (AR) – view of physical, real-world environment – input augmented with computer-generated sensory input. World goes from a ‘dumbed – down’ computer representation to an augmented ‘real-time’ semantic context with environmental elements, e.g., add AR technology like computer vision or object recognition) to make the experience more real world.

For LOD, this would allow for a better context given a certain vocabulary.

Author able to access their data as linked open data. … Did they create the links, they are not totally clear but I am assumming so.

Library Linked Data Incubator Group
Wiki reference sent by Jane Greenberg and Todd Vision.
Mission (from wiki) help increase global interoperability of library data on the Web, by bringing together people involved in Semantic Web activities.

Mentions that no longer a need to work with library-specific data formats such as MARC. I think this is misleading – vocabularies are a format. Access to data will be dependent on RDFizing which will in turn have to choose RDF types. Granted, there could be multiple mappings or loose structures but nevertheless, these are still formats and types.

Metadata to consider: dublin core (creator, date), FRBR (work and manifestations), MARC21 (bibliographic records and authorities, FOAF and ORG (people and organizations)
Value Vocabularies (similar to metadata structures to consider but may not have an RDF definition): LCSH (books), Art and Arch Thesaurus, VIAF (authorities), Geonames (geographical locations)

List several use cases. Vocabularies mentioned:

SKOS, FOAF, BIBO, DC, DCTerms, FRBR (RDA), CiTO, RDFa, owlt, rdfs, isbd, rdvocab, Dewey.info, Lexvo, Geonames, LCSH, RAMEAU, Linked Data Services der DNB, Instituto Geografico Nacional (spain), EDM, DBpedia, BIO, Music Ontology, Organizational Ontology, OWL, UMBEL, new Civil War vocabulary, MADS in RDF, Book, vocabs from Library of Congress, OAI-ORE, DOAP, PRONOM, CIDOC-CRM, ULAM, TGN, DDC, UDC, Iconclass, DC CDType, DC Accrual Method, DC Frequence, DC Accrual Policy, PRISM, vcard, hcard, geo, W3C Media ontology, SURF ORE, SURF objectmodel, OPM, EXIF, rdaGr2, p20vocab, event ontology, darwin core DWC, Statistical Core SCOVO, Data Cube, Citation Typing Ontology, Facebook opengraph, google snippets, yahoo search monkey

most common: SKOS, FOAF, BIBO, DC, DCTerms

Oxford Dryad Group
Ryan Scherle sent link about David Shotton’s group at Oxford has been working on an RDF mapping for Dryad metadata. Links has a file on the RDF that was created from dryad records. They are using a few vocabularies in addition to Dublin Core, e.g., fabio, frbr, and prism. Seems like less would still be useful. Not sure why nothing is expressed as types. Some of the links are not reachable.
Datacite ontology seems most useful because identifying primary and alternate identifiers. The ontology for this can be found here.
Crossref
Link sent by Hilmar. They discuss intentions to support URI dereferencing and content negotiation for DOI entities but not done yet – from what I can find.
LOD-LAM
Sent by Todd Vision, a discussion effort to consider OAI-PMH and its integration with the semantic web. ?my question: what should access to a file look like, e.g., the download of a file.
This group had a meeting June 2,3, 2011 discussing an integration with LOD. Similar issues to consider: what vocabulary, what should be returned. Mentions the way libraries manage concepts but RDF allows for expressing real things. Need to be able to describe both concepts and actual data. This sounds like they are referring to the need to capture metadata about in some predefined vocabulary as well as specific metadata from the data.
Two vocabularies mentioned: MARC, FRBR, no real agreement on vocabulary
Need to model data and use vocabulary to connect things
Need to add relationships
Paper on OAI2LOD server that handles:

  • content negotiation
  • URI dereferencing
  • SPARQL access to metadata
  • XHTML and RDF serialization formats for content negotiation

They are reviewing data for links to support more linked data and linked data browsing
Defining classes as needed
#LODLAM on twitter
DataONE API
Ryan Scherle sent an email out about notes added to the Dryad Wiki about the Dryad member node implementation of the DataONE API. Matt Jones had mentioned the api to me before but no implementation details were available. I did read an DataONE API doc that explained member nodes and coordinating nodes.

Calls to member nodes return similar structure as the OAI-PMH API calls to get dublin core based xml from the corresponding DATAONE member nodes. Actually quite simple, I could easily interchange with the structure I have using OAI-PMH.

Seems to be available for Dryad data only. I do not believe that the ORNL DAAC or KNB have created DataONE compliant servers.
Although the interface is nice, I don’t think I understand why there is now an additional interface, why was DataONE API necessary to replace the OAI-PMH API. It seems that the OAI-PMH has a large server base, DataONE API does not. They seem to be returning similar things and the OAI-PMH data structure seems to be open as well.

DataONE Inaugural Meeting Notes
Discusses how to build a group of data member nodes, focused primarily on collecting and publishing data and metadata, and a group of coordinating nodes, focused primarily on facilitating searches and providing redundancy across member nodes.
Member nodes are to find mappings to a defined dataone structure. I wonder if this is better served via a SPARQL query returning RDF results. Allowing member nodes to handle returning data in an RDF vocabulary that is defined by DataONE.
For the data, there is no single data structure imposed.

D1 4 challenges to address:

  • data loss
  • data dispersion
  • data deluge
  • poor practice
  • Structure dependent on three entities: member nodes, coordinating nodes and investigator toolkit.

    Member node data: primarily earth science, id for each data item – unique identifier
    primarily data content
    map internal ds to d1 ds
    data small
    content in widely used data formats
    Data modeled: data + scientific metadata that describes properties of the data
    + system generated metadata
    Science data is opaque sets of bytes stored on member nodes. Coordinating Node:
    Copy of science metadata on coordinating nodes – parsed to extract attributes to assist in discovery process.
    Holds scientific metadata, focus on discovery
    Investigator toolkit: tools that help with understanding the data – context.
    DataONE to provide background for networks to interoperate

    There are diverse needs of stakeholders:

  • kind of collection: species, temporal
  • access to raw data: geographic information about rare species
  • For LOD, it seems that both member nodes and coordinating nodes will need to support dereferencing of nodes to a namespace (member nodes) and content negotiation.
    I don’t see these details changing a lot. Redundancy of data would help with issues of semantic web, as long as reliability of data and quality of the network are not negatively affected.

    The structured definition of data and metadata could integrate with tools (translation-e.g., format migration, extraction-e.g., rendering, merging-e.g., combining multiple instances)

    There are many things I did not focus on e.g., security and storage capacity.

    DataONE is the structural backbone of this member node/coordinating node/Investigator toolkit infrastructure.

    Background Questions
    Important reference – the information in this document is very useful and answers some of these questions. Sent by Todd Vision

    How did Dryad get started?: – facts about Dryad
    Dryad emerged from a NESCent workshop entitled “Digital data preservation, sharing, and discovery: Challenges for Small Science Communities in the Digital Era” in May 2007.
    Official site started in January 2009.
    Dryad is a repository to archive for data for publications – peer reviewed articles; primarily for biological and ecological data.
    – Dryad provides data curation by examining data and metadata ensure reusable; data files migrated to archival formats. They have some control over the data format. A curator reviews data published.
    – Link data (more like metadata) to specialized databases (e.g. GenBank and TreeBase)
    – Data gets a DOI – independent of articles, data is citable
    Works with partner journals to collect metadata and house the data. Journals house the paper.
    Goals : focus on preserving data at time of publication, lower burden of data sharing, make data uniquely identifiable, searchable. Currently searches focused on metadata which might include data manually extracted from data. Does not usually accept unpublished data, i.e., data not associated with a publication.
    Users submit data, description of publication, data description. Receives a DOI. Expect information regarding reuse, e.g., description of column headings, to be provided in README file. Submissions occur via the datadryad.org website or through the partner journals.
    Data files collected together into a data package
    From D1 meeting: data associated with journal articles in the basic and applied sciences

    How did KNB get started?: – facts about KNB
    The Knowledge Network for Biocomplexity (KNB) is a national network intended to facilitate ecological and environmental research in biocomplexity. Data is published with metadata, using EML as the metadata description language. Collaborators are ecological and environmental scientists from around the nation/world. KNB is cross-site, interdisciplinary, synthetic research focused on the collection of data descriptions (in EML) to allow for discovering and accessing data that is distributed in widely dispersed locations, e.g., housed close to the primary users, but usually inaccessible to others.
    Main issues handling: data is widely dispersed, heterogenous, need for synthetic analysis tools.
    Support an architecture that provides data access that handles users defining their data (in EML) and placing these descriptions in a central repository (Metacat metadata server), information management that helps identify useful data sets to create synthesized data sets and supports quality tests for shared data, finally, the KNB architecture supports knowledge management which provides tools for data exploration and visualization.
    KNB is a tool (a suite of tools?) that synthesize relevant environmental information to address important ecological relationships and advance ecological understanding.
    Users log in to the KNB site to register their datasets, which are placed in the Metacat XML database. Morpho is a tools used to register data, that is to create data sets and manage them, e.g., availability, access control, etc.
    From D1 meeting: biodiversity, ecological and environmental data from a highly distributed set of field-stations, laboratories, research sites and individual researchers.
    File types: tabular relational data, vector and matrix data, raster images, vector images and audio.
    Interconnected set of Metacat dta and metadata management servers – main servers at NCEAS and LTER. Access via REST and SOAP style interaction of services.
    Metadata: ecological metadata language (EML), biological data profile (BDP), ISO 19139, Modeling Markup Language (MoML) and others

    How did ORNL DAAC get started?: – facts about ORNL-DAAC
    The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) was created to support biogeochemical dynamics. The ORNL DAAC archives data produced by NASA’s Terrestrial Ecology Program. The DAAC provides data and information relevant to biogeochemical dynamics, ecological data, and environmental processes. This data is critical for understanding the dynamics relating to the biological, geological, and chemical components of Earth’s environment.
    ORNL-DAAC supports many repositories that provide data exploration and visualization for finding and configuring datasets for download, through the Mercury data search tool. The tools seem powerful for a scientist that knows what they want or has some domain knowledge. This is probably true with all the data repositories, the complexity and configurability of the tools is what emphasizes it here.
    Published best practices for collecting and sharing data, to make it more useful.
    Researchers contact DAAC directly to share data, and metadata.
    From D1 meeting: data from terrestrial ecology and biogeochemical dynamics produced by NASA’s Terrestrial Ecology Program.
    900 data sets (1TB total volume)
    Types of data: spatial and tabular data
    Data contributed by contacting ORNL DAAC and discussing curation, consistent with Open Archival Information System reference model.
    Acquire data in many ways – search mercury tool. Open spatial Data access tool.
    FTP, OAI-PMH
    What metadata structure is used to hold Dryad Data and what type and structure of data can it describe?
    NESCent and the Meta Research Center conducted a vocabulary assessment from keywords in articles from partner journals, in an effort to identify appropriate vocabularies for representing Dryad data objects. Found that no single vocabulary was sufficient. Hive project focused on keyword extraction, more specifically, dynamically integrating multiple controlled vocabularies. There are tool kits to help with this Current meta-data collected: title, description, user suggested keywords, relations to multiple documents, etc.
    Dryad uses DataCite to register DOIs for data. DataCite, specify metadata that will be associated with each DOI, used to discover data.
    From D1 meeting notes: stored in Dublicn Core, includes dublin-core fields, darwin core and bibliographic ontology. Some locally defined fields.
    Seems to be no limits to file format. Have a recommended list from University of Texas Libraries, recommended if want long term support for file type.
    From D1 meeting notes: no restrictions on file and content format. Could be tabular, text, multimedia, etc.
    Interesting observations:
    1)Dublin-core used to label most standard terms, dc has a straight-forward RDF mapping. Actually, this is due to the oai_dc mapping for OAI-PMH.
    2)dc:relation.external used to point to any external relationships between data, in general, and other knowledge – this record points to a record or query in an external source of information. 3)Currently does not seem to provide finer grained linking of information . e.g., field in the dataset links directly to a field of a record in the external data source.
    4)dc:relation – links to other things as well, but these are mainly internal
    5)All entities have a ScientificName – does this make it a good search field? What makes a good search field for finding related data? – common, not unique?

    What metadata structure is used to hold KNB Data and what type and structure of data can it describe?
    Ecological Metadata Language (EML) is the metadata specification used to describe shared datasets on KNB. EML are shared XML document types, made up of several modules, each designed to describe a logical part of the total metadata that should be included with any ecological dataset. If users have metadata for their data, there are recommendations for converting it. Again, Morpho can be used to create EML compliant metadata. Metacat is the repository that allows for searching a larger set of EML docs. – EML data is modular and can be linked together as needed, standards like CSDGM monolithic.
    Seems to not be limited to describing a specific structure of data. References the University of Texas Libraries. Data can be included in EML’s physical module’s inline element. Inline binary data must be base64 encoded. The number of modules and the details can be overwhelming … at least from what I (AG) saw. Using the provided tools seems to be the realistic way to build EML metadata. Did not investigate EML further to determine if there were further data file type restrictions.
    Accessing data/metadata not as straightforward as what I found for Dryad … at least there is not a simple description. Probably means there are more robust ways to get data but requires more overall knowledge.

    What metadata structure is used to hold ORNL-DAAC data?
    Metadata structure very flexible in that researchers contact DAAC to discuss metadata.
    Subset of FGDC, Dublin core, gcmd dif and nasa echo.

    How can I automatically get data from Dryad?:
    Dryad has a wiki page that describes data access of the Dryad repository. Sitemap can be used to traverse site.
    OAI-PMH is a service-based protocol that can be used to search the metadata in the Dryad repository. METS is a search page tools provided by Dryad to access the Dryad data. A sequence of commands allows for searching the meta-data using the OAI-PMH interface to identify data set ids, then using the METS interface to obtain and download data bitstreams through data URLs. These calls seem straightforward and in fact the examples provided made it easy to find data.
    One issue, Dryad URL for a resource can show metadata. Unfortunately terms not always the same as what is found on the object. For example: page shows that there is a field called: dc:relation.uri does not exist on the data, either does dc:relation.haspart. These differences require me to enter more specific code to identify similar characteristics, if possible.
    One nice thing about this interface is that I can see the metadata online, by asking for a full metadata view. This view helps me understand what I should expect from my data initially.

    How can I automatically get data from KNB?:
    KNB data can be retrieved by using an API within for Metacat to search the EML-based repository. Through the metacat api . There are a lot of calls that can be used to access a metacat repository that is accessible through the KNB metacat server http://knb/ecoinformatiocs.org/metacat/
    Basically, I would establish a public connection then start obtaining data.
    KNB also has an OAI-PMH interface. It might be useful to see how to generalize the step of obtaining metadata and then accessing and extracting information from data itself.

    How can I automatically get data from ORNL-DAAC?:
    Found data and metadata directories. Could not find how to get this information automatically. Can perform configurations and add to a shopping cart but no access to a url.
    Web pages from searches provide some redirect call but this is not documented – can’t seem to figure it out.
    They have an OAI-PMH repository. Where is it?

    Dataset Questions

    What datasets were chosen from Dryad and why?:
    Ryan Scherle was helpful in sending information on datasets. He sent these suggestions:
    * Associated article, using the article DOI and CrossRef’s LOD services.
    * File type (i *think* there is LOD on file types somewhere)
    * Taxon names — Not all records in Dryad will have them, but here’s one: doi:10.5061/dryad.487

    In addition, he pointed me to datasets of interest at: https://www.datadryad.org/wiki/Sample_Dryad_Content
    Chose 6 datasets.
    doi:10.5061/dryad.82 because it has a Treebase link. I might be limited in what search information I can extract from the data; I may not even find a direct link but it has a good testcase from Treebase.
    This dataset has part: http://handle/10255/dryad.83. Could search for artiodactyls, hunting, extinction. Other dc:subject terms. There is an excel file and the dataset is related to a Treebase record. The Treebase record is NOT the data, just related to it.
    doi:10.5061/dryad.8437 because it has a GenBank link. This link is not as clearly visible as the previous data (e.g., dc:external_relation descriptor) so it will be interesting to see if I can show this one.
    This dataset has parts http:/handle/10255/dryad.8438, dryad.8439, dryad.8440, dryad.8441. I notice this should not be used yet because it is embargoed. Does this research count since technically I am not using the results? On Genbank HQ540559. The Genbank record IS the data.
    doi:10.5061/dryad.234 because it is the most popular. This leads me to believe that it will relate to multiple ‘things’ – so a good candidate. Again, data not necessarily accessible to me.
    This dataset has part http://handle/10255/dryad.235. The data file is an excel file. From the excel file could link with Family, Binomial, Density, Region and reference#.
    doi:10.5061/dryad.1252, doi:20.5061/dryad.1295 and doi:10.5061/dryad.2016 because I have more access to their data. These all have fields that I could at the least relate in dates or taxonomies.
    each has parts I can use as well as the local files that I can extract data from if links to the LOD are found. e.g. latitude and longitude with ODE’s where view
    1252 csv file has fields clade, museum number, state/province, country, lat, long.
    1295 has fields provenance and region. Actually made up of more than one csv file both have Readme file describes fields. Could have a structured file describe extraction fields.
    2016 csv file has fields plot, nest – fields not as easy to see application. Also has Readme.
    Note: just going by csv accessible is hard when looking for ‘quality’ of the cloud.

    The doi:10.5061/dryad.8437 had an error with the GenBank link. Apparently the author entered the wrong GenBank code. The good thing was that the record could be fixed by a site administrator.

    What datasets were chosen from KNB and why?:
    There were no specific suggestions on the data to use from this community. In addition, I found no metrics like the most popular. The biggest difficulty in finding data on KNB was that many of the data files are links to other sites. The other site resulted in a login screen or accessing a new query tool or a new structure to have to traverse for the data. So, the datasets I chose will at the least assure that the data is accessible. This might change as I understand how to use the metacat api and find the data for a given metacat resource.

    KNB datasets chosen were:
    connolly.116.10 because it has 4 txt files and the metadata explains the file attributes. Specifically Connolly.1051.1-AppendixD-1.txt.
    Second, connolly.2773.3 PHS8495.txt. Also because it is text but also because it seems to have an internal similarity to connolly.272.3 NUT9195A.txt, the third choice.

    At the least, I can minimally show an internal relationship to the data. I hope to achieve better cloud integration once the community sees an initial integration.

    What datasets were chosen from ORNL-DAAC and why?:
    No suggestions from the community.
    Searched for csv files as mentioned by Giri, using ORNL-DAAC, otherwise the data file could not be downloaded. The numbers given do not provide me a way to access data in automated form.
    Look in data repository. csv hard to find – found three. Chosen in part because they are in csv and that was hard. Still, can see integrations with GIS type info (long, lat) date, year. In addition, some are describing the same information and come from the same data branch:
    global_soil/Global_Soil_Respiration/data/srdb_data_v1.csv
    global_soil/Global_Soil_Respiration/data/srdb_studies_v1.csv
    fluxnet/level_2_data/tunbarumba/data/source_data/tumbarumba_data_in_fluxnet_format_010222-081231.csv

    Discuss the larger integration points (i.e., links to other common things) that make sense for Dryad data:
    Looking on the LOD, I am expecting to link data to GenBank, Treebase, some dataset that I can use the ScientificName that all resources on the Dryad repository seem to have and possible DOI links. I would like to find a repository that has a link to some field on the data for all three datasets but I don’t see it yet. For example, much of the data might have a DOI, but that is not true in the metadata for anything other than the Dryad data.
    Looking at the ODE tool, they have a Where tab. It would be nice to see where the data was stored. A DOI OAI-PMH service was just announced on dryad-dev. What if I create a small cloud and link data from that to this data. This should the potential to link data out in a cloud.
    Dryad data shown as full metadata has dc:relation, dwc:ScientificName and dc:identifier that can be used to link data. Unfortunately, in oai_dc format, the same fields are not available.
    In searching through this data and links, I found that there are links to data in Catalog of Life, Encyclopedia of Life, and iPhylo. I should be able to link to DBPedia data as well.
    . By entering http://www.eol.org/search?q= … I can get related data. There is an RDF view. I could create links to this data and show the link back – or I could use the ODE sponger to extract what it can from the eol pages or rdf and then find relationships between the two.
    Discuss the larger integration points (i.e., links to other common things) that make sense for KNB data:
    KNB integration with the LOD I hope to come from a taxonomy. Given the difficulty in finding data, it was hard to spend time mapping data to the LOD. EML is quite large and extensive, I am sure there are integration points, it will just take longer to make use of this information because there is a lot in there.

    Discuss the larger integration points (i.e., links to other common things) that make sense for ORNL-DAAC:
    still working on this one.

    Discuss any vocabularies that make sense for the chosen Dryad data:
    I will use the existing metadata as the Dryad vocabulary, in particular those that come from the metadata section in the GetRecord verb of OAI-PMH. That is, all metadata will be converted to RDF. I will extract additional data as it links to the LOD cloud and I will give it ad-hoc names.
    Not being able to see this data in the ODE leads me to believe that the vocabulary I am using, an ad-hoc RDF, is not sufficient for ODE. I will look into OWL and RDFSchema to see if either make this better.

    Discuss any vocabularies that make sense for the chosen KNB data:
    I will use the existing metadata (EML) as the KNB vocabulary. That is, all metadata will be converted to RDF. I am considering using OAI-PMH because it is a consistent way to get and map the metadata – I think. I will extract additional data as it links to the LOD cloud and I will give it ad-hoc names. I also hope to see and discuss the fact that this vocabulary is being used by many sites and there are already integrations that consider vocabulary (RDF, OWL).

    Discuss any vocabularies that make sense for the chosen ORNL-DAAC data:
    More ad-hoc than the other two vocabularies because I have no meta-data to work with. For the csv files, I can work with column names and give them a ORNL-DAAC vocabulary initially then focus on how it integrates with the cloud.

    Discuss URIs that should be given to the chosen Dryad data:
    Initial URIs will be based on Dryad. The about statement will refer to the dryad handle for the metadata, because the RDF is based on that entry. Notice that I will supplement the RDF with additional statement from the data so the new RDF will contain more information than the metadata found from the dryad handle. Currently, the RDF is generated dynamically. I personally prefer that because it allows for getting the current snapshot without having to store so much information, or evolution of information. The downside, the Semantic Web is based on links. If information is linked and that document changes, the linked data maybe invalid. This is an argument against dynamically generated RDF.

    Discuss URIs that should be given to the chosen KNB data:
    Initial URIs will be based on KNB. The naming of KNB data is trickier. Access to data is not necessarily available over the Web so the about field is more obscure. Currently, one will be made up. I foresee the issue to be when outside users, i.e., from the cloud, need access to KNB data, the naming needs to be openly available.

    Discuss URIs that should be given to the chosen to ORNL-DAAC data:
    Initial URIs will be based on ORNL-DAAC. Seems to have similar issues as KNB although I have less information here.

    Other Notes

  • Dryad – best practices for data archiving: focused on concrete suggestions for what, how and when of data archiving and sensible guidelines for data reuse. – promote data archiving and responsible data reuse. e.g, author contact, co-authorship, attribution/credit. Prevent data misuse.
  • Data archiving in ecology and evolution: best practices:
    Mentions reason why data archiving good for science & reuse by a large community. Author focuses on how to archive a dataset to make it more usable. Focused on archiving data associated with papers, data associated with unpublished projects do not have paper to give important context, methods and meta-data required to interpret results.
    Best practices for data archiving:
    1. Choose an archive most suitable for your data e.g., GenBank for DNA sequence, TreeBASE for phylogenetic trees. Personal websites tend not to persist.
    2. Consider future users, clearly communicate data and context. Improve chance for reuse and avoids some level of future questions.
    3. Annotate data with metadata (e.g., EML) or a README file.
    4. Start data management while analyzing data and writing paper.
    5. Be careful not to archive sensitive or prohibited information.
    6. Use recommended non-proprietary file formats, e.g. comma-delimited or tex file rather than Word or Excel.
    7. Test your data, rerun several analysis reported in paper. Try to get an extra pair of experienced eyes look at the archive/readme/metadata.
    8. Archive the data soon after collection but find plans for embargo until ready for public access.
    Best practices for data use, respect for work:
    1. Never reuse data without careful reading of original papers and associated materials. Minimize misuse.
    2. Recreate some results from original paper
    3. Contact original authors to discuss reuse of data
    4. Offer co-authorship when original author’s NEW input reaches non-trivial level. Otherwise, cite original work in new work.
    5. Cite original paper and associated data, using full citations as with a research paper. “Cite others’ data as you would like your data to be cited”
    6. Always check and recheck discrepancies or errors in original work then contact original authors for clarification. If error substantial, may contact journal with corrigendum. Attempt to work this with original authors.
    Best practices for editors and publishers
    1. Expect archiving data during research, not just after publication.
    2. Allow for embargos for a specified time
    3. Invite original authors to review publications that make extensive use of particular data.
    AG: Questions:
    1. No mention about social data
    2. No mention about incremental data and its alignment to overall research publications, e.g., at least mention need for plan
    Recent adoption of mandatory data archiving: Joint Data Archiving Policy (JDAP)
    Policy of required deposition to be adopted by Dryad partner journals. Mutual dependence, journals require & dryad lowers barrier for storing and curating. Data can be embargoed for up to a year. Partner journals, e.g., The American Naturalist, Evolution, Evolutionary Applications, Heredity (The Genetics Society) …
  • KNB and ORNL-DAAC also have their best practices. Did not compare them.
  • Todd mentioned best practice of preventing misuse. This seems to be a role for publishers as well. Should data be searchable in any available way but only accessible if you can trace the source? Does RDF help with this?

    Initial Conclusions

    Each repository has its own data and metadata. Although this research is considering how to use the data in searching for data, the metadata already there should be leveraged to find the dataset as well. They are being used now and several useful tools depend on them. This information could then be supplemented with data specific information. Each dataset repository has an OAI-PMH interface. As expected they each have the same verbs, as per the OAI-PMH spec, nevertheless, each has its own intricacies. In particular, the language used to represent the metadata. Furthermore, OAI-PMH, being a meta-data harvester, does not provide access to data. Thus, understanding the mechanisms of the 3 repositories to access the data is necessary. As noted, these are all different per site and not all seem to be easy to automate. Dryad seems to have a documented process that is easy to replicate. KNB has a single location to get data and its related metadata from, the Metacat server. Currently, the only automated data or metadata access I have achieved from ORNL-DAAC is via a traversal of the ftp and metadata directory structures.
    There is no single RDF structure, e.g., data definition language, that could encompass the data provided at each of these sites. As a result, we would run into similar issues as with OAI-PMH where we would need to support many metadata structures. Understanding how to push the current dataset repository metadata and data information to RDF could help answer the following questions:
    What is the minimal amount of work needed to expose data and metadata about a dataset to the cloud?
    How useful is this to help find specific data in an RDF browser?
    How useful is this when needing specific knowledge, e.g., semantic meaning as in X = Y of X isA Y?
    — Are there other questions we should expect to answer?

    References

    http://www.datadryad.org/about, About Dryad, 1/28/2011
    http://datadryad.org/factSheet, Dryad Fact Sheet, 1/28/2011
    http://knb.ecoinformatics.org/index.jsp, The Knowledge Network for Biocomplexity
    http://knb.ecoinformatics.org/software/eml/, Ecological Metadata Language (EML)
    http://knb.ecoinformatics.org/morphoportal.jsp, Morpho Data Management Software
    http://daac.ornl.gov/, ORNL DAAC
    http://mercury.ornl.gov/ornldaac/, Mercury Metadata Search System
    http://ils.unc.edu/mrc/, Metadata Research Center
    https://www.nescent.org/sites/hive/Main_Page, HIVE
    https://www.nescent.org/sites/hive/Research, Research
    http://repositories.lib.utexas.edu/recommended_file_formats, Recommended File Format
    https://www.nescent.org/wg_dryad/DataAccess, Data Access
    http://knb.ecoinformatics.org/software/download.jsp#metacat, Download
    http://knb.ecoinformatics.org/software/metacat/dev/api/, Javadoc
    https://code.ecoinformatics.org/code/metacat/trunk/docs/dev/oaipmh/MetacatOaipmh.pdf
    http://daac.ornl.gov/PI/pi_info.shtml
    http://daac.orn.gov/data/, ORNL-DAAC ftp data repository
    http://blog.datadryad.org/, Best practices for data archiving, 1/26/2011, Todd Vision
    Michael C. Whitlock (2011) Data archiving in ecology and evolution: best practices, Trends in Ecology & Evolution, 26 (2): 61-65. doi:10.1016/j.tree.2010.11.006.
    Wren, J.D. (2008) URL decay in MEDLINE – a 4-year follow-up study. Bioinformatics 24, 1381–1385
    http://www.datadryad.org/jdap, Joint Data Archiving Policy (JADP) 11/16/2010
    http://www.openarchives.org/OAI/openarchivesprotocol.html, The Open Archives Initiative Protocol for Metadata Harvesting

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    *