Week 4: Creating Parser in Apache Tika for onedcx file format

Hi All,

This blog is in conjunction with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. Continuing from the last week, we explored Apache serve functionality for detecting the Custom mime types for the DataONE file format. The httpd.conf file of the server is a configuration file, and the “MIMEMagicFile” directive can be set to point to a custom magic file. Due to the limitation in the use of magic file for Apache server as compared to the file command, we can’t use it for defining the metadata of DataONE file format. We updated the documentation for the file identification using the file command, Apache Tika, and Apache server.

The next objective is to explore the further functionality offered by Apache Tika for detecting and parsing the metadata of the file formats. Apache Tika provides numerous functionality for content detection and extraction by the simple configuration of the Apache Tika application.

In Week 3, we completed the task of adding and finding custom mime types for DataONE file format. This week we set our goal to create a parser in Tika for extracting the metadata of onedcx file format. We created a maven project as a development environment and added the required Tika and custom-mimetypes.jar file.

During the creation of the parser files for the onedcx file format, we faced issues in extracting the metadata of the file using the default functionality. The onedcx file is an XML file, and the contents of it were parsed using the XMLParser class of Tika. The text contents of the file were successfully extracted but not the metadata of the file, which we needed. The issue is caused due to the default functionality of Tika for XML file, which extracts the text contents of the XML but not the metadata excepts for the files with Dublin core format. For resolving this issue we created a onedcx file with the XML tags of the metadata. It also set the XML namespace of the file too.

Once, the metadata file for the onedcx XML file was created, the metadata extraction was successfully performed. We leveraged the functionality of the Dublin core file format for creation of the onedcx parser and the metadata.

In the coming weeks, we will work on the parsers for the other DataONE file formats.

That’s all for now, see you all next week!

Have a great weekend!

Resource links: Github-file_identification, Github-DataONE Parser, Project Plan

Leave a Reply Cancel reply