Hi All,<\/p>\n
This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources<\/a>. After resetting our goals for rest of the project in the previous week. The goal is to extract metadata from different file formats using Apache Tika.<\/p>\n
This works fine, but if the XML file structure is changed or a field is added or removed, then it would fail. This approach won’t scale as well. We want to get specific metadata fields from the input file, hence a better approach is to get the metadata and its child nodes using the specific tags. The new method getMetadata uses the file and tags as input parameter and prints the node Name and node values for the metadata field. it also takes care of the childNodes for the passed XML tag.<\/p>\n
The next step in achieving our goal is to create the metadata fields and the mapper. Though we created it earlier but re-designing it will be a good thing. This is needed as the metadata fields are different across file format for example, <abstract> field in EML format has a child Node, whereas for onedcx it doesn’t. Same is for the other fields such as spatial, temporal etc. For now, the fields like title, address etc are mapped in the Mapper class.<\/p>\n
The metadadata fields present in the List are passed as the input to parser class and it returns the metadata field and values. The output is correct but still needs certain cases to be handled such as returning of the empty tags. In the coming week, our goal is to redesign the mapper and create the metadata fields for it.<\/p>\n
That\u2019s all for now, see you all next week!<\/p>\n
Have a great weekend!<\/p>\n
<\/p>\n
Resource links: Github-file_identification<\/a>, Github-DataONE Parser,<\/a> Project Plan<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"
Hi All, This blog is in follow-up with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. After resetting our goals for rest of the project in the previous week. The goal is to extract metadata from different file formats using Apache Tika. Since we want Continue reading Week 6: Parser, Metadata Mapper Using Apache Tika<\/span>