This week my main goal was to get a Part of Speech (PoS) tagger up and running. After some searching and testing I decided to use the Natural Language Toolkit (NLTK.org). While it has to be installed (as opposed to running in a jar or python egg) it runs quickly and effectively. It provides a huge host of options and does a good job. While still not perfect, it works smoothly and more than satisfies our needs.
However, because I was using an existing library, it didn’t take nearly the whole week. So I moved on to next weeks goals, which were to generate an ontology to represent the corpus using thesauri and the identified PoS words. This was slightly complicated as the current state-of-the-art approach only enables equivalence associations.
To improve this I came up with a fairly novel approach to enable the inference of subclasses based upon synonym symmetry. That is, if you have two words, lets say Cow and Mammal, we can infer that cow is a subclass with the following process. Use the thesauri to look up all synonyms for cow and for mammal. We would find things like the following: cow->steer, mammal, bessie, milk, etc mammal->animal, classification, etc. Thus, we would see that while mammal is a synonym for cow, cow is NOT a synonym for mammal. What does that mean? Simply that ALL cows are mammals, but that NOT ALL mammals are cows. Or in other words, cow is a subclass of mammal.
While this approach is not perfect, and introduces so possible weird associations (as not all synonyms make sense within a given context), initial testing suggests that this can greatly increase the ability of our technique to determine coverage. With this, I have most of the code completed to generate an ontology using a thesauri lookup algorithm. I also have same non-formal test cases to go with this.