Normalizing data – DataONE Notebooks

This week had me spending most of my time normalizing my two data sets (my corpus and ontology). This normalizing process (for the corpus) was fairly straight forward and involved a few steps,
1) remove punctuation
2) force lower case
3) remove stop words
4) stem each word
5) remove “number words”

The process of removing punctuation, forcing lower case and removing stop words already had scripts to do most of this from last year.

To stem, I utilize an existing library (pypi stemming) that allows for four types of stemming: lovins, paice, porter, and snowball. These range from heavy-weight stemmers (lovins and paice) to lightweight rules-based stemmers (porter and snowball). The code is implemented in such a way that the type of stemmer can be chosen. Lastly, a “number word” is a word that contains no non-number characters (e.g., 1827 or 832489). While these might be useful to a person, they provide no meaningful value in this type of evaluation. I also included formal test cases.

The process for the ontology was slightly more tricky as most text within the ontology file was metadata (as the ontology is stored in XML). For this, I used regular expressions to isolate those features (e.g., the classes) of the ontology to be normalized. Each ontology is then saved in a separate directory allowing for the comparison between the original and the stemmed version. I then wrote formal test cases to ensure the code was meeting expectations.

An example:
INPUT: “I am having the…arrgg! Just the hardest day, with, my new vegetable champion 7000.”
OUTPUT: arrgg just hard day new vegetable champion

This will allow for better text matching in the future.

Leave a Reply Cancel reply