{"id":3200,"date":"2018-06-15T22:24:00","date_gmt":"2018-06-15T22:24:00","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=3200"},"modified":"2019-05-24T17:47:34","modified_gmt":"2019-05-24T17:47:34","slug":"week-3-custom-mimetypesmagic-file-for-the-dataone-file-formats-for-identification-using-apache-tikaapache-web-server","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/extending-libmagic\/week-3-custom-mimetypesmagic-file-for-the-dataone-file-formats-for-identification-using-apache-tikaapache-web-server\/","title":{"rendered":"Week 3: Custom mimetypes\/magic file for the DataONE file formats for identification using Apache Tika\/Apache web server"},"content":{"rendered":"
Hi All,<\/p>\n
This blog is in conjunction with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources<\/a>. In the last week we were able to create the magic file for the file <\/strong>command and the repository admins of it also accepted and committed the changes in the library. This week we wanted to explore the tool Apache Tika<\/a> in more depth and wanted to have a similar functionality available in it for identifying the DataONE file<\/a> formats. In this week, we completed the below tasks and will be setting new goals for the internship in the coming weeks.<\/p>\n The source code and the jar files for the Tika application were downloaded and compiled using maven for creating the custom-mimetypes.xml file.<\/p>\n The Tika application uses an xml file “tika-mimetypes.xml” which contains the information of the different mime-types supported by Tika. The files supports different tags which contains the informations for identification of the file types, such <glob > and <magic>. The magic tag resembles to the functionality of the magic numbers used in the libmagic.<\/p>\n We created the custom-mimetypes.xml file which uses the magic tag for identifying the DataONE file<\/a> formats. The jar file was then created for this xml which can included in the class path of the TikaCLI for identifying the file formats. We faced a blocking issues where the default glob patterns for the file extension can’t be over-ridden. The issue was resolved after removal of the tag which checks for the starting “<?xml” tag in the file.<\/p>\n The apache2 server also uses the magic file for identifying the mime-types for the file it serves. We are working on creating a custom magic file for the apache web server, which can identify the DataONE file formats.<\/p>\n Once, the creation of the magic file for the apache web server is completed, we would have three tools i.e. File command, Apache Tika, Apache web server capable in identification of the DataONE file formats<\/p>\n That’s all for now, see you all next week!<\/p>\n Have a great weekend!<\/p>\n Resource links: Github<\/a>,Project Plan<\/a><\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" Hi All, This blog is in conjunction with my earlier blogs for the Project 4: Extending Libmagic for Identification of Science Resources. In the last week we were able to create the magic file for the file command and the repository admins of it also accepted and committed the changes in the Continue reading Week 3: Custom mimetypes\/magic file for the DataONE file formats for identification using Apache Tika\/Apache web server<\/span>\n
\n
\n
\n