{"id":3227,"date":"2018-06-29T21:59:27","date_gmt":"2018-06-29T21:59:27","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=3227"},"modified":"2019-05-24T17:47:26","modified_gmt":"2019-05-24T17:47:26","slug":"week-5-parser-in-apache-tika-for-dataone-file-format","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/extending-libmagic\/week-5-parser-in-apache-tika-for-dataone-file-format\/","title":{"rendered":"Week 5: Parser in Apache Tika for DataONE file Format."},"content":{"rendered":"

Hi All,<\/p>\n

This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources<\/a>. In this week, we shared our progress with other developers by giving a short demo. We shared the working of file command and Apache Tika for custom detection of the DataONE file formats. The tasks for this week were more exploratory in nature as compared to previous week.<\/p>\n

Last week, we were able to successfully create a metadata and a parser file for Onedcx file format by leveraging the functionality of DcXMLParser. Following the same path we tried to create the metadata and parser file for the latest EML version, but it didn’t work as expected. The XML file for the onedcx consist of the “dc:” prefix in the XML tags whereas the eml XML file doesn’t have the prefix. The code for the EML parser for extraction of the metadata works fine if the “eml:” prefix is used in the tags with the proper namespace of the file. We tried with the different namespaces for working of the code but it failed.<\/p>\n

Keeping our goal in mind, which is to detect the content type of the file for DataONE file format, extract the metadata values from it. Hence, we took a step back and tried to create a parser which extracts the Nodes and values from the XML file. Learned about few approaches for extraction of data from the XML documents.<\/p>\n