{"id":1218,"date":"2013-06-07T16:29:41","date_gmt":"2013-06-07T16:29:41","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1218"},"modified":"2013-06-07T16:29:41","modified_gmt":"2013-06-07T16:29:41","slug":"normalizing-data","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/ontology-coverage\/normalizing-data\/","title":{"rendered":"Normalizing data"},"content":{"rendered":"

This week had me spending most of my time normalizing my two data sets (my corpus and ontology). This normalizing process (for the corpus) was fairly straight forward and involved a few steps,
\n1) remove punctuation
\n2) force lower case
\n3) remove stop words
\n4) stem each word
\n5) remove “number words”<\/p>\n

The process of removing punctuation, forcing lower case and removing stop words already had scripts to do most of this from last year. <\/p>\n

To stem, I utilize an existing library (pypi stemming) that allows for four types of stemming: lovins, paice, porter, and snowball. These range from heavy-weight stemmers (lovins and paice) to lightweight rules-based stemmers (porter and snowball). The code is implemented in such a way that the type of stemmer can be chosen. Lastly, a “number word” is a word that contains no non-number characters (e.g., 1827 or 832489). While these might be useful to a person, they provide no meaningful value in this type of evaluation. I also included formal test cases.<\/p>\n

The process for the ontology was slightly more tricky as most text within the ontology file was metadata (as the ontology is stored in XML). For this, I used regular expressions to isolate those features (e.g., the classes) of the ontology to be normalized. Each ontology is then saved in a separate directory allowing for the comparison between the original and the stemmed version. I then wrote formal test cases to ensure the code was meeting expectations.<\/p>\n

An example:
\nINPUT: “I am having the…arrgg! Just the hardest day, with, my new vegetable champion 7000.”
\nOUTPUT: arrgg just hard day new vegetable champion<\/p>\n

This will allow for better text matching in the future.<\/p>\n","protected":false},"excerpt":{"rendered":"

This week had me spending most of my time normalizing my two data sets (my corpus and ontology). This normalizing process (for the corpus) was fairly straight forward and involved a few steps, 1) remove punctuation 2) force lower case 3) remove stop words 4) stem each word 5) remove Continue reading Normalizing data<\/span>→<\/span><\/a><\/p>\n","protected":false},"author":42,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15],"tags":[143,151,141,142],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1218"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/42"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=1218"}],"version-history":[{"count":1,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1218\/revisions"}],"predecessor-version":[{"id":1220,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1218\/revisions\/1220"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=1218"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=1218"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=1218"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}