This blog is in conjunction with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. In the last week we were able to create the magic file for the file command and the repository admins of it also accepted and committed the changes in the library. This week we wanted to explore the tool Apache Tika in more depth and wanted to have a similar functionality available in it for identifying the DataONE file formats. In this week, we completed the below tasks and will be setting new goals for the internship in the coming weeks.
- Installing and using Apache Tika for detecting the file types
The source code and the jar files for the Tika application were downloaded and compiled using maven for creating the custom-mimetypes.xml file.
- Usage of magic tags in the tika-mimetypes.xml.
The Tika application uses an xml file “tika-mimetypes.xml” which contains the information of the different mime-types supported by Tika. The files supports different tags which contains the informations for identification of the file types, such <glob > and <magic>. The magic tag resembles to the functionality of the magic numbers used in the libmagic.
- Creation of the Custom-mimetypes.xml and resolving the issue for overriding the default <glob> patterns for file having .xml extension.
We created the custom-mimetypes.xml file which uses the magic tag for identifying the DataONE file formats. The jar file was then created for this xml which can included in the class path of the TikaCLI for identifying the file formats. We faced a blocking issues where the default glob patterns for the file extension can’t be over-ridden. The issue was resolved after removal of the tag which checks for the starting “<?xml” tag in the file.
- Creating Custom magic files for the Apache2 server.
The apache2 server also uses the magic file for identifying the mime-types for the file it serves. We are working on creating a custom magic file for the apache web server, which can identify the DataONE file formats.
Once, the creation of the magic file for the apache web server is completed, we would have three tools i.e. File command, Apache Tika, Apache web server capable in identification of the DataONE file formats
That’s all for now, see you all next week!
Have a great weekend!
Resource links: Github,Project Plan