{"id":3242,"date":"2018-07-06T23:00:00","date_gmt":"2018-07-06T23:00:00","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=3242"},"modified":"2019-05-24T17:47:21","modified_gmt":"2019-05-24T17:47:21","slug":"week-6-parser-metadata-mapper-using-apache-tika","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/extending-libmagic\/week-6-parser-metadata-mapper-using-apache-tika\/","title":{"rendered":"Week 6: Parser, Metadata Mapper Using Apache Tika"},"content":{"rendered":"

Hi All,<\/p>\n

This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources<\/a>. After resetting our goals for rest of the project in the previous week. The goal is to extract metadata from different file formats using Apache Tika.<\/p>\n

Since we want to extract the metadata field from a given input file format, a custom parser for the file format needs to be created. Every file format has a different XML structure and different metadata field. A single parser cannot extract all the metadata fields present. using StaX class in Java, we created a parser last week for parsing the contents of the XML file, which worked fine. However, the StaX class provides functionality for the forward reading of the XML. We can’t query or extract specific XML tags or fields from it. The DOM class in Java provides us the flexibility for getting the fields using XPath queries or using the method getElementsByTagName.<\/p>\n

We created a parser and.a class XPathMetdata, which consist of XPath queries for all the metadata fields present in EML file as below:  The XPath for the field is then passed to the parser method “getNodeValues”, which returns the List of the Nodes from the file.<\/p>\n