Week 8: DataONE MetaData parser Application

Hi All,

This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. This week was very fruitful and we were able to resolve most of our design and development issues for the final application. The application developed is in its final stages with some of the final touches left before release.


  1. Unified Labels: One of our motives was to represent the metadata fields which have different node names across the file formats with the same label. For example, the spatial field in onedcx is represented as “<dcterms:spatial“, whereas in eml-211 node “<coverage”  and its child represent it. They represent the same spatial information about the coverage of the dataset, hence we used the “label” prefix as attributes in the field tag of the config file for representing it.  For example as:  <field label=”Subject”>.  The application reads the attribute value and assigns it to the XPath or element tag for that field. While printing the metadata fields and values parsed from the input file, these labels are represented instead of the XPath or element tags.
  2. getNodeAttr Method: For extracting, the namespace, prefix, and label attributes, the getNodeAttr method was created. It extract the prefix and uri attribute values and creates a hashmap list of it. For the label attribute, it returns the value as String. This method helped in resolving the issue for using the namespace and for creating the common names for different metadata fields across different formats having same context or functionality.
  3. Escape Characters issue: The tika extracted the file format from the custom-mimetypes.xml file, but it prints the formatId with the escape characters “text/XML; formatid=“http\:\/\/ns.dataone.org\/metadata\/schema\/onedcx\/v1.0”. The string generation for the file is handled by the Tika, we couldn’t resolve it. However, the printing of the filetype in our application is handled by us, and we used the replace method of the String class for removal of such characters.
  4. Null Values: While printing the values for the metadata fields it used to print null values if the field had child nodes. This didn’t look aesthetically pleasing in the representation of the output. For fixing this issue, we printed the values by concatenating it with “;” only if they are not null. This made the output more visually aesthetic.
  5. The configuration of Other File Formats: The successful implementation of the onedcx and eml-211 file format allowed us to move forward. We created the entries for the other file format in our config file as well.
  6. Documentation: For documenting the code, we used javadoc utility for providing the information about the classes and its method created.
  7. Application Jar File: A Jar file for the application needs to be created so that it can be used from the command line, by passing the input file name. Facing issue with the configuration of the project as it states “no main manifest attribute,” error while using the jar file generated. The Manifest file is missing from the project due to which application entry point can’t be found. Working on this issue for resolution.

In the coming week, below task needs to be completed:

  1. Creation of the readme file for the application.
  2. Creation of application Jar file.
  3. Two Unit test case.
  4. Poster creation for the summer internship.

Thatโ€™s all for now, see you all next week!

Have a great weekend!

Resource links: Github-file_identificationGithub-DataONE Parser,  Project Plan

Leave a Reply

Your email address will not be published. Required fields are marked *