{"id":3499,"date":"2019-06-24T15:23:00","date_gmt":"2019-06-24T15:23:00","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=3499"},"modified":"2019-06-24T15:27:51","modified_gmt":"2019-06-24T15:27:51","slug":"week-5-association-rules-and-midterm-evaluation","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/prov-self\/week-5-association-rules-and-midterm-evaluation\/","title":{"rendered":"Week 5 – Association Rules and Midterm Evaluation"},"content":{"rendered":"\n

Hello World! <\/p>\n\n\n\n

This week is a week filled with code and debug. Based on previous research, we decided to focus on the Galaxy Zotero Group and do hands-on experiments\u00a0after discussing with Bertram. Following is the process to achieve our first goal: explore tags distribution.<\/p>\n\n\n\n

Data Collection<\/strong><\/p>\n\n\n\n

The way to get the data is not easy for me since Galaxy Group uses a little strange JS infrastructure to build\u00a0the website, which means it is not easy for code to capture the real data source. But finally, I got it! By using the Chrome Inspector, I found the API and then got the data I want. Meanwhile, \u00a0it is necessary for us to send an email to the Group asking for permission to use the data for research.<\/p>\n\n\n\n

Data Summary: this data contains\u00a07754 Rows and each row means a paper related to the Galaxy Project. Furthermore, there are 26 attributes in the datasets, including paper title, paper type( journal\/ conference\u2026.), creators (authors), abstract, dot, tags, etc. <\/em><\/p>\n\n\n\n

Data Cleaning<\/strong><\/p>\n\n\n\n

After collecting the raw data, the next step goes to data cleaning. Since our goal is to explore the tags distribution and paper content, column \u201cpaper_title\u201d, \u201cDOI\u201d and \u201ctags\u201d are extracted from the raw data. <\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

Explore \u201ctags\u201d <\/strong><\/p>\n\n\n\n

Summary<\/strong> Columns \u201ctags\u201d is a highly interesting part for us because it contains keywords related to provenance research, for example, \u201creproducibility\u201d. The project objective is to find out the current usage of provenance tools in academia and this column is a good point to start with. Furthermore, \u201ctags\u201d contains manually added ones and those automatically generated by Zotero, which is also attractive for us to dive deeper. <\/p>\n\n\n\n

Data Transform <\/strong>Column \u201ctags\u201d could be retreated as categorized data. For further analysis, here this column would be transformed as a metrics and each tag would become a column in the new dataset. The value of each cell shows whether this paper contains this tag (value =1) or not (value =0). <\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

Tags Visualization and associate rules<\/strong> After getting the transformed data metrics, it is easy to generate histogram and show the distribution of each tag. <\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

Based on the previous discussion, association rules between each tag becomes another goal for us to achieve and the result is shown below.\n<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

We can see that the relationship between tags \u201c+UseLocal\u201d and \u201c+Methods\u201d, \u201c+UseMain\u201d and \u201c+Methods\u201d, \u201c+UsePublic\u201d and \u201c+Methods\u201d are strong. But no further meaning behind these relationships, which means more operations and algorithms should be launched to dig out useful information.  <\/p>\n\n\n\n

Hope you all have a good weekend. <\/p>\n","protected":false},"excerpt":{"rendered":"

Hello World! This week is a week filled with code and debug. Based on previous research, we decided to focus on the Galaxy Zotero Group and do hands-on experiments\u00a0after discussing with Bertram. Following is the process to achieve our first goal: explore tags distribution. Data Collection The way to get Continue reading Week 5 – Association Rules and Midterm Evaluation<\/span>→<\/span><\/a><\/p>\n","protected":false},"author":124,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[391],"tags":[],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/3499"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/124"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=3499"}],"version-history":[{"count":2,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/3499\/revisions"}],"predecessor-version":[{"id":3505,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/3499\/revisions\/3505"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=3499"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=3499"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=3499"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}