This week I finished the coverage analyzer tool and have run tests on over 200 popular ontologies (i.e., the SWEET ontologies)! As this is the main part of the project (the coverage tool) and getting this data is over a week ahead of time, I’m quite pleased.
The coverage tool is simple and efficient is its technical aspects were discussed last week. So while there are more nuts and bolts I implemented and tested, I don’t imagine that’s as interesting as the results from the SWEET ontologies.
So what are these results? Well, first let me frame the experiment. I used the corpus from the DataONE which has over 47,000 documents. However, due to memory constraints and the limitations of my 12GB machine, I only used 1,000 documents. These documents were normalized, and using a thesauri were transformed into an ontology with equivalent relations and subclass relations (as discussed in prior weeks). I then made a simple script to analyze the coverage for each SWEET ontology (there are 221 of them) against this “corpus ontology.”
The results were surprising as only two ontologies matched the corpus at all, and their scores were less than 5%. That for me was somewhat shocking, out of 221 ontologies only two match ANYTHING in the corpus (or the limited part we checked). What this means is that the classes and relations that exist in the SWEET ontologies are not, semantically speaking, in the same domain.
While this is quite interesting, it also presents a problem as we would ideally like to try more interesting subdomain searching techniques, but need to have a somewhat related ontology for it. Thus, I am currently checking other ontologies along with framing a system that will identify subtopics within a corpus and select those documents that most represent those subdomains (to ascertain whether a given ontology might match only a specific subdomain and not the domain as a whole).