So this week I finished the subtopic matching software and after some testing, ran it. Basically my goal was to answer the question, does coverage decrease significantly if you remove “topically” unrelated documents from the corpus?
I found that surprisingly yes. While the SWEET ontologies are small, and attempt to be precise in their dealings (meaning that they try to only cover a very small topic), I found that the coverage scores for any given ontology decrease by roughly 1/5 if non-topically related documents are removed.
This means that while documents in a corpus treat topics in one way, the ontologies treat them in a different way. In other words, the corpus and the ontologies each view “topics” differently (as represented by word choice).
I also found after some various bug fixes that the SWEET ontologies tend to get a the coverage as shown on the this PDF <allCoverage> for the DataONE corpus. While the class score is reasonably high, the subclass score is not. And while the subclass score is low, it is a significant first step as we are only using a synonym based heuristic to determine subclasses.
All in all, the final results from this project show that while the SWEET ontologies match quite a bit of the DataONE corpus, they still are far from a perfect match. This means that more work needs to be done to refine an ontology that accurately reflects the entirety of the DataONE corpus.
Also, future work might investigate how the topics within ontologies differ from the topics within scientific documents and articles. Understanding this difference should improve our ability to make more “domain-specific” ontologies