Week 7: DataONE Metadata Parser

Hi All,

This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. In this week, we discussed to add more functionality in our parser and to make it easily configurable for the new file formats to add.

Using a configuration file: During our daily meetings, we found that it would be very useful and helpful to have a configuration file which can be used for adding new metadata fields and file format. We created a configFile.xml which is an XML containing the metadata fields for the different file format. The file contains the metadata fields as tags or XPath expression. It contains the namespace tags if the file format uses a prefix tag. The application read the config file for detecting file format and then returns the XPath for extracting the metadata field for the file type. If the file is not a dataone file format, it should be handled by a different parser.
Using XPath for extracting contents: The DataOneXMLParser.java have one method for extracting values using the getelementtags method, and we added one more using the XPath expression. This would help in extracting the specific values using the path if there are multiple values for the same field. We faced some issues while extracting the
Default Tika parser for default file format: In DataOneMapper.java class we created a method, which returns the List of metadata fields using a default Tika parser if the file is not a dataone file. This handles the boundary case for our application.
SimpleContex class: When we added the XPath expression as a field in the configuration file, it didn’t work as expected in the first run. The XPath for the onedcx file where not resolving successfully. The onedcx file has tags which use prefix as “dc” and “dcterms” and for resolution of the correct path, the namespace needs to be set for these prefixes. The use of namespace and prefix required to create a class SimpleContext which sets the prefix and namespace URI in a hashmap list. The XPath object uses the setNamespaceContext method for setting the namespace for the prefix. This worked fine after a small alignment of the method calls in the class.

In the next week, we will be working on how to use the namespace fields and set them while using the XPath expression. We will be performing some cleanup with the code and testing application for stability.

That’s all for now, see you all next week!

Have a great weekend!

Resource links: Github-file_identification, Github-DataONE Parser, Project Plan

Leave a Reply Cancel reply