{"id":859,"date":"2011-06-15T15:35:57","date_gmt":"2011-06-15T21:35:57","guid":{"rendered":"http:\/\/notebooks.dataone.org\/lod4dataone\/?page_id=37"},"modified":"2013-05-15T15:36:17","modified_gmt":"2013-05-15T15:36:17","slug":"notes","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/general\/notes\/","title":{"rendered":"Notes"},"content":{"rendered":"<p>In an effort to understand the 3 repositories for this research I began to collect some notes.  In particular they are based on some questions that I asked as I started which were formulated as a result of the overall goals of this project.  Feel free to provide some input.  I can break away a conversation on any of these, as a blog discussion, if you would like to develop them further.  You can request this as a post here and I will start it on the blog page.<\/p>\n<p>I am updating these as the weeks progress; this research requires that I go over many of these questions iteratively to iron out the details of this research.  Apologies if these notes seem incomplete, they are a result of the notes I take as I use tools, search through DataONE repositories and search the Web for related references\/technologies.  They may also be in need of severe editing, I will do this as I go along.  Originally, I wrote these in Word where I could attach footnotes with references and the document had references cited throughout, but they did not carry over to this page.  References are below.<\/p>\n<p>These notes are being used to help me<\/p>\n<ol>\n<li>understand the repositories and how they might be accessed separately to produce similar data<\/li>\n<li>understand what my options are for accessing the data selected for this research and for generating RDF and external LOD links<\/li>\n<li>further understand the use-cases that emerge for this project<\/li>\n<\/ol>\n<p><strong>Dryad Metadata Application Profile<\/strong><br \/>\nEmail about this sent by Jane Greenberg &#8211;<br \/>\nIn general it seems to me that the DataONE community is working through the same questions.  In fact, to me, more importantly, the specific vocabulary is not as important as whether the selected vocabulary or vocabularies describe the needed data correctly and can be leveraged to search, understand and reuse data.<br \/>\nJane sent me this link about the Metadata Application Profile version 3 with some background that exhibited this same idea.  In what I have found, leveraging (e.g. search, view) the RDF data will occur through tools that are able to leverage common terms like DC Terms (data, author, title) or WGS84 for location.<br \/>\nThe Dryad Metadata Application Profile is based on the dublin core metadata initiative abstract model (DCAM) intended to conform to the Dublin Core Singapore Framework.  <\/p>\n<p>metadata schemes\/ vocabularies:<br \/>\nbibo  http:\/\/biblioontology.com\/<br \/>\ndcterms http:\/\/dublincore.org\/documents\/dcmi-terms<br \/>\ndryad http:\/\/datadryad.org\/metadata<br \/>\ndwc http:\/\/rs.tdwg.org\/dwc\/index.htm<\/p>\n<p>Much of the dcterms are what I got from the OAI-PMH api but I did not get bibo, dryad or dwc data.  DWC has the scientific name so I would have liked that one.<\/p>\n<p>The DataONE API implementation does give me access to some of the fields that are missing in the dc results for the OAI-PMH services.<\/p>\n<p>Some terms I want to play around with are:<br \/>\ndcterms:spatial\/Spatial Coverage<br \/>\ndcterms:temporal\/Temporal Coverage<br \/>\nSpatial needs to converted though, because the strings do not align with any geo strings.<\/p>\n<p>I do not find the dcterms:hasPart on the OAI-PMH dc calls like mentioned.  I only find dc:relations then I have to figure it out.  In fact, I have two different methods because dcterms:relations is cyclic between publication data and package data.<\/p>\n<p>The entire description seems pretty normal in the fact that Dryad plans to use multiple vocabularies and ontologies.  This is the only way to really express knowledge in a way that will hold relation with other things.<\/p>\n<p><strong>Hive<\/strong><br \/>\nHilmar sent me information on this.  HIVE, helping interdisciplinary vocabulary engineering, &#8211; a model for dynamically integrating multiple controlled vocabularies.  <\/p>\n<p>HIVE seems like it might play an important role in content negotiation, where requests are made about data and specific vocabularies are requested or entered for searching.<\/p>\n<p><strong>ALA Conference<\/strong><br \/>\nJane Greenberg sent a summary of two conferences she went to.  Interesting comment with respect to Library Linked Data:<br \/>\nIn particular people expressing frustration with the lack of applications for adequately displaying linked data, the labor intensive cost of creating LLD, and registry short-comings. <\/p>\n<p>These fall in line with what I have found in understanding how to use RDF browsers with different levels of rdfization of Dryad and KNB data.<\/p>\n<p><strong>schema.org<\/strong><br \/>\nemail sent by Jane Greenberg<br \/>\nLaunch of schema.org will support microdata.  Normally focsued on microdata, microformats and RDFa.<br \/>\nProvides a collection of schemas, i.e. html tags, that webmasters can use to markup their pages in ways recognized by major search providers.<\/p>\n<p><strong>chembliacs<\/strong><br \/>\nEmail sent from Todd Vision.<br \/>\nKasabi makes RDF data available by Talis.  Makes data available and can have data hosted as dereferencable resources.  SPARQL endpoint.  Can access data through API, idea is to augment RDF with data from a dataset.   Augmented reality (AR) &#8211; view of physical, real-world environment &#8211; input augmented with computer-generated sensory input.  World goes from a &#8216;dumbed &#8211; down&#8217; computer representation to an augmented &#8216;real-time&#8217; semantic context with environmental elements, e.g., add AR technology like computer vision or object recognition) to make the experience more real world.<\/p>\n<p>For LOD, this would allow for a better context given a certain vocabulary.<\/p>\n<p>Author able to access their data as linked open data.  &#8230; Did they create the links, they are not totally clear but I am assumming so.<\/p>\n<p><strong>Library Linked Data Incubator Group<\/strong><br \/>\nWiki reference sent by Jane Greenberg and Todd Vision.<br \/>\nMission (from wiki) help increase global interoperability of library data on the Web, by bringing together people involved in Semantic Web activities.<\/p>\n<p>Mentions that no longer a need to work with library-specific data formats such as MARC.  I think this is misleading &#8211; vocabularies are a format.  Access to data will be dependent on RDFizing which will in turn have to choose RDF types.  Granted, there could be multiple mappings or loose structures but nevertheless, these are still formats and types.<\/p>\n<p>Metadata to consider: dublin core (creator, date), FRBR (work and manifestations), MARC21 (bibliographic records and authorities, FOAF and ORG (people and organizations)<br \/>\nValue Vocabularies (similar to metadata structures to consider but may not have an RDF definition): LCSH (books), Art and Arch Thesaurus, VIAF (authorities), Geonames (geographical locations)<\/p>\n<p>List several use cases.  Vocabularies mentioned:<\/p>\n<p>SKOS, FOAF, BIBO, DC, DCTerms, FRBR (RDA), CiTO, RDFa, owlt, rdfs, isbd, rdvocab, Dewey.info, Lexvo, Geonames, LCSH, RAMEAU, Linked Data Services der DNB, Instituto Geografico Nacional (spain), EDM, DBpedia, BIO, Music Ontology, Organizational Ontology, OWL, UMBEL, new Civil War vocabulary, MADS in RDF, Book, vocabs from Library of Congress, OAI-ORE, DOAP, PRONOM, CIDOC-CRM, ULAM, TGN, DDC, UDC, Iconclass, DC CDType, DC Accrual Method, DC Frequence, DC Accrual Policy, PRISM, vcard, hcard, geo, W3C Media ontology, SURF ORE, SURF objectmodel, OPM, EXIF,  rdaGr2, p20vocab, event ontology, darwin core DWC, Statistical Core SCOVO, Data Cube, Citation Typing Ontology, Facebook opengraph, google snippets, yahoo search monkey<\/p>\n<p>most common: SKOS, FOAF, BIBO, DC, DCTerms<\/p>\n<p><strong>Oxford Dryad Group<\/strong><br \/>\nRyan Scherle sent link about David Shotton&#8217;s group at Oxford has been working on an RDF mapping for Dryad metadata.  Links has a file on the RDF that was created from dryad records.  They are using a few vocabularies in addition to Dublin Core, e.g., fabio, frbr, and prism.  Seems like less would still be useful.  Not sure why nothing is expressed as types.  Some of the links are not reachable.<br \/>\nDatacite ontology seems most useful because identifying primary and alternate identifiers.  The ontology for this can be found <a href=\"http:\/\/purl.org\/spar\/datacite\/\">here<\/a>.<br \/>\n<strong>Crossref<\/strong><br \/>\nLink sent by Hilmar.  They discuss intentions to support URI dereferencing and content negotiation for DOI entities but not done yet &#8211; from what I can find.<br \/>\n<strong>LOD-LAM<\/strong><br \/>\nSent by Todd Vision, a discussion effort to consider OAI-PMH and its integration with the semantic web.  ?my question: what should access to a file look like, e.g., the download of a file.<br \/>\nThis group had a meeting June 2,3, 2011 discussing an integration with LOD.  Similar issues to consider:  what vocabulary, what should be returned.  Mentions the way libraries manage concepts but RDF allows for expressing real things.  Need to be able to describe both concepts and actual data.  This sounds like they are referring to the need to capture metadata about in some predefined vocabulary as well as specific metadata from the data.<br \/>\nTwo vocabularies mentioned: MARC, FRBR, no real agreement on vocabulary<br \/>\nNeed to model data and use vocabulary to connect things<br \/>\nNeed to add relationships<br \/>\nPaper on OAI2LOD server that handles:<\/p>\n<ul>\n<li>content negotiation<\/li>\n<li>URI dereferencing<\/li>\n<li>SPARQL access to metadata<\/li>\n<li>XHTML and RDF serialization formats for content negotiation<\/li>\n<\/ul>\n<p>They are reviewing data for links to support more linked data and linked data browsing<br \/>\nDefining classes as needed<br \/>\n#LODLAM on twitter<br \/>\n<strong>DataONE API<\/strong><br \/>\nRyan Scherle sent an email out about notes added to the Dryad Wiki about the Dryad member node implementation of the DataONE API.  Matt Jones had mentioned the api to me before but no implementation details were available.  I did read an DataONE API doc that explained member nodes and coordinating nodes.<\/p>\n<p>Calls to member nodes return similar structure as the OAI-PMH API calls to get dublin core based xml from the corresponding DATAONE member nodes.  Actually quite simple, I could easily interchange with the structure I have using OAI-PMH.  <\/p>\n<p>Seems to be available for Dryad data only.  I do not believe that the ORNL DAAC or KNB have created DataONE compliant servers.<br \/>\nAlthough the interface is nice, I don&#8217;t think I understand why there is now an additional interface, why was DataONE API necessary to replace the OAI-PMH API.  It seems that the OAI-PMH has a large server base, DataONE API does not.  They seem to be returning similar things and the OAI-PMH data structure seems to be open as well.<\/p>\n<p><strong>DataONE Inaugural Meeting Notes<\/strong><br \/>\nDiscusses how to build a group of data member nodes, focused primarily on collecting and publishing data and metadata, and a group of coordinating nodes, focused primarily on facilitating searches and providing redundancy across member nodes.<br \/>\nMember nodes are to find mappings to a defined dataone structure.  I wonder if this is better served via a SPARQL query returning RDF results.  Allowing member nodes to handle returning data in an RDF vocabulary that is defined by DataONE.<br \/>\nFor the data, there is no single data structure imposed.<\/p>\n<p>D1 4 challenges to address:<\/p>\n<li>data loss<\/li>\n<li>data dispersion<\/li>\n<li>data deluge<\/li>\n<li>poor practice<\/li>\n<p>Structure dependent on three entities: member nodes, coordinating nodes and investigator toolkit.<\/p>\n<p>Member node data: primarily earth science, id for each data item &#8211; unique identifier<br \/>\n   primarily data content<br \/>\n   map internal ds to d1 ds<br \/>\n   data small<br \/>\n   content in widely used data formats<br \/>\nData modeled:  data + scientific metadata that describes properties of the data<br \/>\n  + system generated metadata<br \/>\nScience data is opaque sets of bytes stored on member nodes.  Coordinating Node:<br \/>\nCopy of science metadata on coordinating nodes &#8211; parsed to extract attributes to assist in discovery process.<br \/>\nHolds scientific metadata, focus on discovery<br \/>\nInvestigator toolkit: tools that help with understanding the data &#8211; context.<br \/>\nDataONE to provide background for networks to interoperate<\/p>\n<p>There are diverse needs of stakeholders:<\/p>\n<li>kind of collection: species, temporal<\/li>\n<li>access to raw data: geographic information about rare species<\/li>\n<p>For LOD, it seems that both member nodes and coordinating nodes will need to support dereferencing of nodes to a namespace (member nodes) and content negotiation.<br \/>\nI don&#8217;t see these details changing a lot.  Redundancy of data would help with issues of semantic web, as long as reliability of data and quality of the network are not negatively affected.<\/p>\n<p>The structured definition of data and metadata could integrate with tools (translation-e.g., format migration, extraction-e.g., rendering, merging-e.g., combining multiple instances)<\/p>\n<p>There are many things I did not focus on e.g., security and storage capacity.<\/p>\n<p>DataONE is the structural backbone of this member node\/coordinating node\/Investigator toolkit infrastructure.<\/p>\n<p><strong>Background Questions<\/strong><br \/>\n<strong>Important reference<\/strong> &#8211; the information in <a href=\"https:\/\/www.dropbox.com\/s\/2h4fjqwgsunxl65\/DUG_Packet_2010Dec03.pdf\">this<\/a> document is very useful and answers some of these questions.  Sent by Todd Vision<\/p>\n<p><strong>How did Dryad get started?:<\/strong> &#8211; facts about Dryad<br \/>\nDryad emerged from a NESCent workshop entitled &#8220;Digital data preservation, sharing, and discovery: Challenges for Small Science Communities in the Digital Era&#8221; in May 2007.<br \/>\nOfficial site started in January 2009.<br \/>\nDryad is a repository to archive for data for publications &#8211; peer reviewed articles; primarily for biological and ecological data.<br \/>\n&#8211;\tDryad provides data curation by examining data and metadata ensure reusable; data files migrated to archival formats.  They have some control over the data format.  A curator reviews data published.<br \/>\n&#8211;\tLink data (more like metadata) to specialized databases (e.g. GenBank and TreeBase)<br \/>\n&#8211;\tData gets a DOI \u2013 independent of articles, data is citable<br \/>\nWorks with partner journals to collect metadata and house the data.  Journals house the paper.<br \/>\nGoals : focus on preserving data at time of publication, lower burden of data sharing, make data uniquely identifiable, searchable.  Currently searches focused on metadata which might include data manually extracted from data.  Does not usually accept unpublished data, i.e., data not associated with a publication.<br \/>\nUsers submit data, description of publication, data description.  Receives a DOI.  Expect information regarding reuse, e.g., description of column headings, to be provided in README file.  Submissions occur via the datadryad.org website or through the partner journals.<br \/>\nData files collected together into a data package<br \/>\nFrom D1 meeting: data associated with journal articles in the basic and applied sciences<\/p>\n<p><strong>How did KNB get started?:<\/strong> &#8211; facts about KNB<br \/>\nThe Knowledge Network for Biocomplexity (KNB) is a national network intended to facilitate ecological and environmental research in biocomplexity.   Data is published with metadata, using EML  as the metadata description language.  Collaborators are ecological and environmental scientists from around the nation\/world.  KNB is cross-site, interdisciplinary, synthetic research focused on the collection of data descriptions (in EML) to allow for discovering and accessing data that is distributed in widely dispersed locations, e.g., housed close to the primary users, but usually inaccessible to others.<br \/>\nMain issues handling: data is widely dispersed, heterogenous, need for synthetic analysis tools.<br \/>\nSupport an architecture that provides data access that handles users defining their data (in EML) and placing these descriptions in a central repository (Metacat metadata server), information management that helps identify useful data sets to create synthesized data sets and supports quality tests for shared data, finally, the KNB architecture supports knowledge management which provides tools for data exploration and visualization.<br \/>\nKNB is a tool (a suite of tools?) that synthesize relevant environmental information to address important ecological relationships and advance ecological understanding.<br \/>\nUsers log in to the KNB site to register their datasets, which are placed in the Metacat XML database.  Morpho  is a tools used to register data, that is to create data sets and manage them, e.g., availability, access control, etc.<br \/>\nFrom D1 meeting: biodiversity, ecological and environmental data from a highly distributed set of field-stations, laboratories, research sites and individual researchers.<br \/>\nFile types: tabular relational data, vector and matrix data, raster images, vector images and audio.<br \/>\nInterconnected set of Metacat dta and metadata management servers &#8211; main servers at NCEAS and LTER.  Access via REST and SOAP style interaction of services.<br \/>\nMetadata: ecological metadata language (EML), biological data profile (BDP), ISO 19139, Modeling Markup Language (MoML) and others<\/p>\n<p><strong>How did ORNL DAAC get started?:<\/strong> &#8211; facts about ORNL-DAAC<br \/>\nThe Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) was created to support biogeochemical dynamics.   The ORNL DAAC archives data produced by NASA&#8217;s Terrestrial Ecology Program. The DAAC provides data and information relevant to biogeochemical dynamics, ecological data, and environmental processes.  This data is critical for understanding the dynamics relating to the biological, geological, and chemical components of Earth&#8217;s environment.<br \/>\nORNL-DAAC supports many repositories that provide data exploration and visualization for finding and configuring datasets for download, through the Mercury  data search tool.  The tools seem powerful for a scientist that knows what they want or has some domain knowledge.  This is probably true with all the data repositories, the complexity and configurability of the tools is what emphasizes it here.<br \/>\nPublished best practices for collecting and sharing data, to make it more useful.<br \/>\nResearchers contact DAAC directly to share data, and metadata.<br \/>\nFrom D1 meeting: data from terrestrial ecology and biogeochemical dynamics produced by NASA&#8217;s Terrestrial Ecology Program.<br \/>\n900 data sets (1TB total volume)<br \/>\nTypes of data: spatial and tabular data<br \/>\nData contributed by contacting ORNL DAAC and discussing curation, consistent with Open Archival Information System reference model.<br \/>\nAcquire data in many ways &#8211; search mercury tool.  Open spatial Data access tool.<br \/>\nFTP, OAI-PMH<br \/>\n<strong>What metadata structure is used to hold Dryad Data and what type and structure of data can it describe?<\/strong><br \/>\nNESCent and the Meta Research Center  conducted a vocabulary assessment from keywords in articles from partner journals, in an effort to identify appropriate vocabularies for representing Dryad data objects.  Found that no single vocabulary was sufficient.  Hive project focused on keyword extraction, more specifically, dynamically integrating multiple controlled vocabularies.   There are tool kits to help with this   Current meta-data collected: title, description, user suggested keywords, relations to multiple documents, etc.<br \/>\nDryad uses DataCite to register DOIs for data.  DataCite, specify metadata that will be associated with each DOI, used to discover data.<br \/>\nFrom D1 meeting notes: stored in Dublicn Core, includes dublin-core fields, darwin core and bibliographic ontology.  Some locally defined fields.<br \/>\nSeems to be no limits to file format.  Have a recommended list  from University of Texas Libraries, recommended if want long term support for file type.<br \/>\nFrom D1 meeting notes: no restrictions on file and content format.  Could be tabular, text, multimedia, etc.<br \/>\nInteresting observations:<br \/>\n<strong>1)<\/strong>Dublin-core used to label most standard terms, dc has a straight-forward RDF mapping.  Actually, this is due to the oai_dc mapping for OAI-PMH.<br \/>\n<strong>2)<\/strong>dc:relation.external used to point to any external relationships between data, in general, and other knowledge &#8211; this record points to a record or query in an external source of information.  <strong>3)<\/strong>Currently does not seem to provide finer grained linking of information .  e.g., field in the dataset links directly to a field of a record in the external data source.<br \/>\n<strong>4)<\/strong>dc:relation \u2013 links to other things as well, but these are mainly internal<br \/>\n<strong>5)<\/strong>All entities have a ScientificName \u2013 does this make it a good search field?  What makes a good search field for finding related data? \u2013 common, not unique?<\/p>\n<p><strong>What metadata structure is used to hold KNB Data and what type and structure of data can it describe?<\/strong><br \/>\nEcological Metadata Language (EML) is the metadata specification used to describe shared datasets on KNB.  EML are shared XML document types, made up of several modules, each designed to describe a logical part of the total metadata that should be included with any ecological dataset.  If users have metadata for their data, there are recommendations for converting it.  Again, Morpho can be used to create EML compliant metadata.  Metacat is the repository that allows for searching a larger set of EML docs.  \u2013 EML data is modular and can be linked together as needed, standards like CSDGM monolithic.<br \/>\nSeems to not be limited to describing a specific structure of data.  References the University of Texas Libraries.  Data can be included in EML\u2019s physical module\u2019s inline element.  Inline binary data must be base64 encoded.  The number of modules and the details can be overwhelming \u2026 at least from what I (AG) saw.  Using the provided tools seems to be the realistic way to build EML metadata.  Did not investigate EML further to determine if there were further data file type restrictions.<br \/>\nAccessing data\/metadata not as straightforward as what I found for Dryad \u2026 at least there is not a simple description.  Probably means there are more robust ways to get data but requires more overall knowledge.<\/p>\n<p><strong>What metadata structure is used to hold ORNL-DAAC data?<\/strong><br \/>\nMetadata structure very flexible in that researchers contact DAAC to discuss metadata.<br \/>\nSubset of FGDC, Dublin core, gcmd dif and nasa echo.  <\/p>\n<p><strong>How can I automatically get data from Dryad?:<\/strong><br \/>\nDryad has a wiki page that describes data access of the Dryad repository.   Sitemap can be used to traverse site.<br \/>\nOAI-PMH is a service-based protocol that can be used to search the metadata in the Dryad repository.  METS is a search page tools provided by Dryad to access the Dryad data.  A sequence of commands allows for searching the meta-data using the OAI-PMH interface to identify data set ids, then using the METS interface to obtain and download data bitstreams through data URLs.  These calls seem straightforward and in fact the examples provided made it easy to find data.<br \/>\nOne issue, Dryad URL for a resource can show metadata.  Unfortunately terms not always the same as what is found on the object.  For example: page shows that there is a field called: dc:relation.uri does not exist on the data, either does dc:relation.haspart.  These differences require me to enter more specific code to identify similar characteristics, if possible.<br \/>\nOne nice thing about this interface is that I can see the metadata online, by asking for a full metadata view.  This view helps me understand what I should expect from my data initially.<\/p>\n<p><strong>How can I automatically get data from KNB?:<\/strong><br \/>\nKNB data can be retrieved by using an API within for Metacat to search the EML-based repository.  Through the metacat api .  There are a lot of calls that can be used to access a metacat repository  that is accessible through the KNB metacat server http:\/\/knb\/ecoinformatiocs.org\/metacat\/<br \/>\nBasically, I would establish a public connection then start obtaining data.<br \/>\nKNB also has an OAI-PMH interface.   It might be useful to see how to generalize the step of obtaining metadata and then accessing and extracting information from data itself.<\/p>\n<p><strong>How can I automatically get data from ORNL-DAAC?:<\/strong><br \/>\nFound data and metadata directories.  Could not find how to get this information automatically.  Can perform configurations and add to a shopping cart but no access to a url.<br \/>\nWeb pages from searches provide some redirect call but this is not documented &#8211; can&#8217;t seem to figure it out.<br \/>\nThey have an OAI-PMH repository.  Where is it?<\/p>\n<h1>Dataset Questions<\/h1>\n<p><strong>What datasets were chosen from Dryad and why?:<\/strong><br \/>\nRyan Scherle was helpful in sending information on datasets.  He sent these suggestions:<br \/>\n* Associated article, using the article DOI and CrossRef&#8217;s LOD services.<br \/>\n* File type (i *think* there is LOD on file types somewhere)<br \/>\n* Taxon names &#8212; Not all records in Dryad will have them, but here&#8217;s one: doi:10.5061\/dryad.487<\/p>\n<p>In addition, he pointed me to datasets of interest at: https:\/\/www.datadryad.org\/wiki\/Sample_Dryad_Content<br \/>\nChose 6 datasets.<br \/>\ndoi:10.5061\/dryad.82 because it has a Treebase link.  I might be limited in what search information I can extract from the data; I may not even find a direct link but it has a good testcase from Treebase.<br \/>\nThis dataset has part: http:\/\/handle\/10255\/dryad.83.  Could search for artiodactyls, hunting, extinction.  Other dc:subject terms.  There is an excel file and the dataset is related to a Treebase record.  The Treebase record is NOT the data, just related to it.<br \/>\ndoi:10.5061\/dryad.8437 because it has a GenBank link.  This link is not as clearly visible as the previous data (e.g., dc:external_relation descriptor) so it will be interesting to see if I can show this one.<br \/>\nThis dataset has parts http:\/handle\/10255\/dryad.8438, dryad.8439, dryad.8440, dryad.8441.  I notice this should not be used yet because it is embargoed.  Does this research count since technically I am not using the results?  On Genbank HQ540559.  The Genbank record IS the data.<br \/>\ndoi:10.5061\/dryad.234 because it is the most popular.  This leads me to believe that it will relate to multiple &#8216;things&#8217; &#8211; so a good candidate.  Again, data not necessarily accessible to me.<br \/>\nThis dataset has part http:\/\/handle\/10255\/dryad.235.  The data file is an excel file.   From the excel file could link with Family, Binomial, Density, Region and reference#.<br \/>\ndoi:10.5061\/dryad.1252, doi:20.5061\/dryad.1295 and doi:10.5061\/dryad.2016 because I have more access to their data.  These all have fields that I could at the least relate in dates or taxonomies.<br \/>\neach has parts I can use as well as the local files that I can extract data from if links to the LOD are found.  e.g. latitude and longitude with ODE&#8217;s where view<br \/>\n1252 csv file has fields clade, museum number, state\/province, country, lat, long.<br \/>\n1295 has fields provenance and region.  Actually made up of more than one csv file both have Readme file describes fields.  Could have a structured file describe extraction fields.<br \/>\n2016 csv file has fields plot, nest &#8211; fields not as easy to see application.  Also has Readme.<br \/>\nNote: just going by csv accessible is hard when looking for &#8216;quality&#8217; of the cloud.<\/p>\n<p>The doi:10.5061\/dryad.8437 had an error with the GenBank link.  Apparently the author entered the wrong GenBank code.  The good thing was that the record could be fixed by a site administrator.<\/p>\n<p><strong>What datasets were chosen from KNB and why?:<\/strong><br \/>\nThere were no specific suggestions on the data to use from this community.  In addition, I found no metrics like the most popular.  The biggest difficulty in finding data on KNB was that many of the data files are links to other sites. The other site resulted in a login screen or accessing a new query tool or a new structure to have to traverse for the data.  So, the datasets I chose will at the least assure that the data is accessible.  This might change as I understand how to use the metacat api and find the data for a given metacat resource.<\/p>\n<p>KNB datasets chosen were:<br \/>\nconnolly.116.10 because it has 4 txt files and the metadata explains the file attributes.  Specifically Connolly.1051.1-AppendixD-1.txt.<br \/>\nSecond, connolly.2773.3  PHS8495.txt.  Also because it is text but also because it seems to have an internal similarity to connolly.272.3 NUT9195A.txt, the third choice.<\/p>\n<p>At the least, I can minimally show an internal relationship to the data.  I hope to achieve better cloud integration once the community sees an initial integration.<\/p>\n<p><strong>What datasets were chosen from ORNL-DAAC and why?:<\/strong><br \/>\nNo suggestions from the community.<br \/>\nSearched for csv files as mentioned by Giri, using ORNL-DAAC, otherwise the data file could not be downloaded.  The numbers given do not provide me a way to access data in automated form.<br \/>\nLook in data repository.  csv hard to find &#8211; found three.  Chosen in part because they are in csv and that was hard.  Still, can see integrations with GIS type info (long, lat) date, year.  In addition, some are describing the same information and come from the same data branch:<br \/>\nglobal_soil\/Global_Soil_Respiration\/data\/srdb_data_v1.csv<br \/>\nglobal_soil\/Global_Soil_Respiration\/data\/srdb_studies_v1.csv<br \/>\nfluxnet\/level_2_data\/tunbarumba\/data\/source_data\/tumbarumba_data_in_fluxnet_format_010222-081231.csv<\/p>\n<p><strong>Discuss the larger integration points (i.e., links to other common things) that make sense for Dryad data:<\/strong><br \/>\nLooking on the LOD, I am expecting to link data to GenBank, Treebase, some dataset that I can use the ScientificName that all resources on the Dryad repository seem to have and possible DOI links.  I would like to find a repository that has a link to some field on the data for all three datasets but I don\u2019t see it yet.  For example, much of the data might have a DOI, but that is not true in the metadata for anything other  than the Dryad data.<br \/>\nLooking at the ODE tool, they have a Where tab.  It would be nice to see where the data was stored.  A DOI OAI-PMH service was just announced on dryad-dev.  What if I create a small cloud and link data from that to this data.  This should the potential to link data out in a cloud.<br \/>\nDryad data shown as full metadata has dc:relation, dwc:ScientificName and dc:identifier that can be used to link data.  Unfortunately,  in oai_dc format, the same fields are not available.<br \/>\nIn searching through this data and links, I found that there are links to data in Catalog of Life, Encyclopedia of Life, and iPhylo.  I should be able to link to DBPedia data as well.<br \/>\n.  By entering http:\/\/www.eol.org\/search?q= &#8230; I can get related data.  There is an RDF view.  I could create links to this data and show the link back &#8211; or I could use the ODE sponger to extract what it can from the eol pages or rdf and then find relationships between the two.<br \/>\n<strong>Discuss the larger integration points (i.e., links to other common things) that make sense for KNB data:<\/strong><br \/>\nKNB integration with the LOD I hope to come from a taxonomy.  Given the difficulty in finding data, it was hard to spend time mapping data to the LOD.  EML is quite large and extensive, I am sure there are integration points, it will just take longer to make use of this information because there is a lot in there.<\/p>\n<p><strong>Discuss the larger integration points (i.e., links to other common things) that make sense for ORNL-DAAC:<\/strong><br \/>\nstill working on this one.<\/p>\n<p><strong>Discuss any vocabularies that make sense for the chosen Dryad data:<\/strong><br \/>\nI will use the existing metadata as the Dryad vocabulary, in particular those that come from the metadata section in the GetRecord verb of OAI-PMH.  That is, all metadata will be converted to RDF.  I will extract additional data as it links to the LOD cloud and I will give it ad-hoc names.<br \/>\nNot being able to see this data in the ODE leads me to believe that the vocabulary I am using, an ad-hoc RDF, is not sufficient for ODE.  I will look into OWL and RDFSchema to see if either make this better.<\/p>\n<p><strong>Discuss any vocabularies that make sense for the chosen KNB data:<\/strong><br \/>\nI will use the existing metadata (EML) as the KNB vocabulary.  That is, all metadata will be converted to RDF.  I am considering using OAI-PMH because it is a consistent way to get and map the metadata &#8211; I think.  I will extract additional data as it links to the LOD cloud and I will give it ad-hoc names.  I also hope to see and discuss the fact that this vocabulary is being used by many sites and there are already integrations that consider vocabulary (RDF, OWL).<\/p>\n<p><strong>Discuss any vocabularies that make sense for the chosen ORNL-DAAC data:<\/strong><br \/>\nMore ad-hoc than the other two vocabularies because I have no meta-data to work with.  For the csv files, I can work with column names and give them a ORNL-DAAC vocabulary initially then focus on how it integrates with the cloud.<\/p>\n<p><strong>Discuss URIs that should be given to the chosen Dryad data:<\/strong><br \/>\nInitial URIs will be based on Dryad.  The about statement will refer to the dryad handle for the metadata, because the RDF is based on that entry.  Notice that I will supplement the RDF with additional statement from the data so the new RDF will contain more information than the metadata found from the dryad handle.  Currently, the RDF is generated dynamically.  I personally prefer that because it allows for getting the current snapshot without having to store so much information, or evolution of information.  The downside, the Semantic Web is based on links.  If information is linked and that document changes, the linked data maybe invalid.  This is an argument against dynamically generated RDF.<\/p>\n<p><strong>Discuss URIs that should be given to the chosen KNB data:<\/strong><br \/>\nInitial URIs will be based on KNB.  The naming of KNB data is trickier.  Access to data is not necessarily available over the Web so the about field is more obscure.  Currently, one will be made up.  I foresee the issue to be when outside users, i.e., from the cloud, need access to KNB data, the naming needs to be openly available.<\/p>\n<p><strong>Discuss URIs that should be given to the chosen to ORNL-DAAC data:<\/strong><br \/>\nInitial URIs will be based on ORNL-DAAC.  Seems to have similar issues as KNB although I have less information here.<\/p>\n<h1>Other Notes<\/h1>\n<li>Dryad \u2013 best practices for data archiving:  focused on concrete suggestions for what, how and when of data archiving and sensible guidelines for data reuse.  \u2013 promote data archiving and responsible data reuse.  e.g, author contact, co-authorship, attribution\/credit.  Prevent data misuse.<\/li>\n<li>Data archiving in ecology and evolution: best practices:<br \/>\nMentions reason why data archiving good for science &amp; reuse by a large community.  Author focuses on how to archive a dataset to make it more usable.  Focused on archiving data associated with papers, data associated with unpublished projects do not have paper to give important context, methods and meta-data required to interpret results.<br \/>\nBest practices for data archiving:<br \/>\n1.\tChoose an archive most suitable for your data e.g., GenBank for DNA sequence, TreeBASE for phylogenetic trees.  Personal websites tend not to persist.<br \/>\n2.\tConsider future users, clearly communicate data and context.  Improve chance for reuse and avoids some level of future questions.<br \/>\n3.\t Annotate data with metadata (e.g., EML) or a README file.<br \/>\n4.\tStart data management while analyzing data and writing paper.<br \/>\n5.\tBe careful not to archive sensitive or prohibited information.<br \/>\n6.\tUse recommended non-proprietary file formats, e.g. comma-delimited or tex file rather than Word or Excel.<br \/>\n7.\tTest your data, rerun several analysis reported in paper.  Try to get an extra pair of experienced eyes look at the archive\/readme\/metadata.<br \/>\n8.\tArchive the data soon after collection but find plans for embargo until ready for public access.<br \/>\nBest practices for data use, respect for work:<br \/>\n1.\tNever reuse data without careful reading of original papers and associated materials.  Minimize misuse.<br \/>\n2.\tRecreate some results from original paper<br \/>\n3.\tContact original authors to discuss reuse of data<br \/>\n4.\tOffer co-authorship when original author\u2019s NEW input reaches non-trivial level.  Otherwise, cite original work in new work.<br \/>\n5.\tCite original paper and associated data, using full citations as with a research paper.  \u201cCite others\u2019 data as you would like your data to be cited\u201d<br \/>\n6.\tAlways check and recheck discrepancies or errors in original work then contact original authors for clarification.  If error substantial, may contact journal with corrigendum.  Attempt to work this with original authors.<br \/>\nBest practices for editors and publishers<br \/>\n1.\tExpect archiving data during research, not just after publication.<br \/>\n2.\tAllow for embargos for a specified time<br \/>\n3.\tInvite original authors to review publications that make extensive use of particular data.<br \/>\nAG: Questions:<br \/>\n1.\tNo mention about social data<br \/>\n2.\tNo mention about incremental data and its alignment to overall research publications, e.g., at least mention need for plan<br \/>\nRecent adoption of mandatory data archiving:  Joint Data Archiving Policy (JDAP)<br \/>\nPolicy of required deposition to be adopted by Dryad partner journals.  Mutual dependence, journals require &amp; dryad lowers barrier for storing and curating.  Data can be embargoed for up to a year.  Partner journals, e.g.,  The American Naturalist, Evolution, Evolutionary Applications, Heredity (The Genetics Society) \u2026<\/li>\n<li> KNB and ORNL-DAAC also have their best practices.  Did not compare them.<\/li>\n<li> Todd mentioned best practice of preventing misuse.  This seems to be a role for publishers as well.  Should data be searchable in any available way but only accessible if you can trace the source?  Does RDF help with this?<br \/>\n<h1>Initial Conclusions<\/h1>\n<p>Each repository has its own data and metadata.  Although this research is considering how to use the data in searching for data, the metadata already there should be leveraged to find the dataset as well.  They are being used now and several useful tools depend on them.  This information could then be supplemented with data specific information. Each dataset repository has an OAI-PMH interface.  As expected they each have the same verbs, as per the OAI-PMH spec, nevertheless, each has its own intricacies.  In particular, the language used to represent the metadata.  Furthermore, OAI-PMH, being a meta-data harvester, does not provide access to data.  Thus, understanding the mechanisms of the 3 repositories to access the data is necessary.  As noted, these are all different per site and not all seem to be easy to automate.  Dryad seems to have a documented process that is easy to replicate.  KNB has a single location to get data and its related metadata from, the Metacat server.  Currently, the only automated data or metadata access I have achieved from ORNL-DAAC is via a traversal of the ftp and metadata directory structures.<br \/>\nThere is no single RDF structure, e.g., data definition language, that could encompass the data provided at each of these sites.  As a result, we would run into similar issues as with OAI-PMH where we would need to support many metadata structures.  Understanding how to push the current dataset repository metadata and data information to RDF could help answer the following questions:<br \/>\nWhat is the minimal amount of work needed to expose data and metadata about a dataset to the cloud?<br \/>\nHow useful is this to help find specific data in an RDF browser?<br \/>\nHow useful is this when needing specific knowledge, e.g., semantic meaning as in X = Y of X isA Y?<br \/>\n&#8212; Are there other questions we should expect to answer?<\/p>\n<h1>References<\/h1>\n<p>http:\/\/www.datadryad.org\/about, About Dryad, 1\/28\/2011<br \/>\nhttp:\/\/datadryad.org\/factSheet, Dryad Fact Sheet, 1\/28\/2011<br \/>\nhttp:\/\/knb.ecoinformatics.org\/index.jsp, The Knowledge Network for Biocomplexity<br \/>\nhttp:\/\/knb.ecoinformatics.org\/software\/eml\/, Ecological Metadata Language (EML)<br \/>\nhttp:\/\/knb.ecoinformatics.org\/morphoportal.jsp, Morpho Data Management Software<br \/>\nhttp:\/\/daac.ornl.gov\/, ORNL DAAC<br \/>\nhttp:\/\/mercury.ornl.gov\/ornldaac\/, Mercury Metadata Search System<br \/>\nhttp:\/\/ils.unc.edu\/mrc\/, Metadata Research Center<br \/>\nhttps:\/\/www.nescent.org\/sites\/hive\/Main_Page, HIVE<br \/>\nhttps:\/\/www.nescent.org\/sites\/hive\/Research, Research<br \/>\nhttp:\/\/repositories.lib.utexas.edu\/recommended_file_formats, Recommended File Format<br \/>\nhttps:\/\/www.nescent.org\/wg_dryad\/DataAccess, Data Access<br \/>\nhttp:\/\/knb.ecoinformatics.org\/software\/download.jsp#metacat, Download<br \/>\nhttp:\/\/knb.ecoinformatics.org\/software\/metacat\/dev\/api\/, Javadoc<br \/>\nhttps:\/\/code.ecoinformatics.org\/code\/metacat\/trunk\/docs\/dev\/oaipmh\/MetacatOaipmh.pdf<br \/>\nhttp:\/\/daac.ornl.gov\/PI\/pi_info.shtml<br \/>\nhttp:\/\/daac.orn.gov\/data\/, ORNL-DAAC ftp data repository<br \/>\nhttp:\/\/blog.datadryad.org\/, Best practices for data archiving, 1\/26\/2011, Todd Vision<br \/>\nMichael C. Whitlock (2011) Data archiving in ecology and evolution: best practices, Trends in Ecology &amp; Evolution,  26 (2): 61-65.  doi:10.1016\/j.tree.2010.11.006.<br \/>\nWren, J.D. (2008) URL decay in MEDLINE \u2013 a 4-year follow-up study. Bioinformatics 24, 1381\u20131385<br \/>\nhttp:\/\/www.datadryad.org\/jdap, Joint Data Archiving Policy (JADP) 11\/16\/2010<br \/>\nhttp:\/\/www.openarchives.org\/OAI\/openarchivesprotocol.html, The Open Archives Initiative Protocol for Metadata Harvesting<\/li>\n","protected":false},"excerpt":{"rendered":"<p>In an effort to understand the 3 repositories for this research I began to collect some notes. In particular they are based on some questions that I asked as I started which were formulated as a result of the overall goals of this project. Feel free to provide some input. <a class=\"more-link\" href=\"https:\/\/notebooks.dataone.org\/general\/notes\/\">Continue reading <span class=\"screen-reader-text\">  Notes<\/span><span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":22,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/859"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=859"}],"version-history":[{"count":1,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/859\/revisions"}],"predecessor-version":[{"id":881,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/859\/revisions\/881"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=859"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=859"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=859"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}