Week 6: Parser, Metadata Mapper Using Apache Tika

Hi All,

This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. After resetting our goals for rest of the project in the previous week. The goal is to extract metadata from different file formats using Apache Tika.

Since we want to extract the metadata field from a given input file format, a custom parser for the file format needs to be created. Every file format has a different XML structure and different metadata field. A single parser cannot extract all the metadata fields present. using StaX class in Java, we created a parser last week for parsing the contents of the XML file, which worked fine. However, the StaX class provides functionality for the forward reading of the XML. We can’t query or extract specific XML tags or fields from it. The DOM class in Java provides us the flexibility for getting the fields using XPath queries or using the method getElementsByTagName.

We created a parser and.a class XPathMetdata, which consist of XPath queries for all the metadata fields present in EML file as below: The XPath for the field is then passed to the parser method “getNodeValues”, which returns the List of the Nodes from the file.

emlXpath.add(“//dataset/creator/*”);
emlXpath.add(“//dataset/creator/address/*”);

This works fine, but if the XML file structure is changed or a field is added or removed, then it would fail. This approach won’t scale as well. We want to get specific metadata fields from the input file, hence a better approach is to get the metadata and its child nodes using the specific tags. The new method getMetadata uses the file and tags as input parameter and prints the node Name and node values for the metadata field. it also takes care of the childNodes for the passed XML tag.

The next step in achieving our goal is to create the metadata fields and the mapper. Though we created it earlier but re-designing it will be a good thing. This is needed as the metadata fields are different across file format for example, <abstract> field in EML format has a child Node, whereas for onedcx it doesn’t. Same is for the other fields such as spatial, temporal etc. For now, the fields like title, address etc are mapped in the Mapper class.

metadataFields.add(“title”);
metadataFields.add(“address”);

The metadadata fields present in the List are passed as the input to parser class and it returns the metadata field and values. The output is correct but still needs certain cases to be handled such as returning of the empty tags. In the coming week, our goal is to redesign the mapper and create the metadata fields for it.

That’s all for now, see you all next week!

Have a great weekend!

Resource links: Github-file_identification, Github-DataONE Parser, Project Plan

Leave a Reply Cancel reply