Week 5: Parser in Apache Tika for DataONE file Format.

Hi All,

This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. In this week, we shared our progress with other developers by giving a short demo. We shared the working of file command and Apache Tika for custom detection of the DataONE file formats. The tasks for this week were more exploratory in nature as compared to previous week.

Last week, we were able to successfully create a metadata and a parser file for Onedcx file format by leveraging the functionality of DcXMLParser. Following the same path we tried to create the metadata and parser file for the latest EML version, but it didn’t work as expected. The XML file for the onedcx consist of the “dc:” prefix in the XML tags whereas the eml XML file doesn’t have the prefix. The code for the EML parser for extraction of the metadata works fine if the “eml:” prefix is used in the tags with the proper namespace of the file. We tried with the different namespaces for working of the code but it failed.

Keeping our goal in mind, which is to detect the content type of the file for DataONE file format, extract the metadata values from it. Hence, we took a step back and tried to create a parser which extracts the Nodes and values from the XML file. Learned about few approaches for extraction of data from the XML documents.

  • Document Object Model (DOM).
  • SAX is a standard for event-based XML parsing
  • Streaming API for XML (StaX)
Used the StaX approach for the forward reading of the XML file, and parsing the data from it. This helped in the extraction of the XML metadata and the values associated with it. In the coming weeks, we will be developing a mapper for mapping qualified metadata fields across the different file formats. The parser will use these fields for extracting the data from the file and representing it in a tabular format.

That’s all for now, see you all next week!

Have a great weekend!

Resource links: Github-file_identificationGithub-DataONE Parser,  Project Plan

Leave a Reply

Your email address will not be published. Required fields are marked *

*