At the end of last week I was trying to find a related ontology to enable subtopic matching. However, after some failed attempts, it became clear that until we made an ontology from the whole corpus and not just a part (remember a few weeks ago I talked about memory overload and my performance limited local machine) we should move on.

Thus, this week I spent time working on the subtopic generator. The point of this is to enable the answering of the question, is this ontology mostly related to a specific subtopic of the corpus, or is it cross-cutting in its similarity.

This process is fairly simple in concept. Using LDA (latent dirichlet allocation) you can (non-deterministically) calculate the a word-probability distribution for a set amount of topics. For example, if you tell the algorithm you want 2 topics, it might return results like this:
Topic 1: {dog:0.05, puppy:0.08, canine:0.10, pet:0.25}
Topic 2: {food:0.08, eat:0.02, drink:0.09, water:0.12}

Here you would read this as, for topic 1, the odds that the word dog appear whenever this topic is being discussed is 5%. Likewise, the odds that the word water is used when topic 2 is being discussed is 12%.

Then, you can do a cosine similarity test to determine which documents from the corpus are the most associated with each topic. Basically you transform each word-score to a vector and compare the similarity between the topic vectors and the document vectors from the corpus.

This process is completed and coded up, meaning that you can generate the subtopics, and find the associated documents from the corpus (and using previous code can then use each sub-corpus to be tested for coverage against a given ontology). That was this weeks work.

However, the difficult part of this is that we need to know how many topics to use. LDA requires the user to select the number of topics before it can start. This is complex as our corpus is over 30MB in a straight text file, making a manual investigation out of the question. Thus, I am starting the process of trying lots and LOTS of different topic quantities (tell the algorithm to try 10, 15, 20…25000, 25500, …) and performing a “uniqueness measure”

This uniqueness measure basically calculates how dissimilar a given topic is to ALL other topics generated. For example, the uniqueness score from the two previously used topics in our examples would be 0. They are completely dissimilar and thus completely unique. However, consider the following example:
Topic 1: {kitten:0.05, puppy:0.07, pet:0.09, household:0.11}
Topic 2: {cat:0.05, puppy:0.08, pet:0.11, household:0.02}

These topics have many words in common, and with similar probabilities; meaning that their score would be high, around 0.9. This means that they have a lot in common and thus are very non-unique. I would then assert that this corpus didn’t have 2 topics, it had only a single topic and LDA tried to find two because I asked it to. However, it could not.

Thus, with all of these number of topics I am examining their uniqueness scores to determine an appropriate number of topics for our corpus. However, the number alone is insufficient. It only helps us to know how unique they are. Consider this.

Topic 1: {dog:0.07, cat:0.08, pet:0.10}
Topic 2: {puppy: 0.07, kitten:0.08, domesticated animal:0.10}

These two topics have NO words in common, giving them a perfect uniqueness score. However, we would all look at the words in each topic and see that they are in fact largely similar. In fact, I would argue that they probably should be the same topic. Dog and puppies are similar, so are cats and kittens and pets and domesticated animals. They are likely the same, even if the words are different.

Thus, after looking at the scores, its important to take a representative sample and to manually examine them to determine their quality. This manual examination is my goal for next week.

Leave a Reply

Your email address will not be published. Required fields are marked *