{"id":1962,"date":"2014-02-18T17:55:06","date_gmt":"2014-02-18T17:55:06","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1962"},"modified":"2014-02-18T17:56:44","modified_gmt":"2014-02-18T17:56:44","slug":"continue-scraping-introduce-quality-control-with-hashes","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/continue-scraping-introduce-quality-control-with-hashes\/","title":{"rendered":"Continue Scraping, Introduce Quality Control with Hashes"},"content":{"rendered":"<p>Continuation and completion of harvesting with quality control \/ assurance exploration using hashes and checksum software.<\/p>\n<p>5 months agoReplyRetweetFavorite1 more<\/p>\n<p><strong>Start 97 &#8211; 77<\/strong><\/p>\n<p>97 contains year 3 and offset 450<\/p>\n<p>Start at 12:05<\/p>\n<p>Save text file<br \/>\n<strong>Topsy-97-77<\/strong><\/p>\n<p>End at 12:21<\/p>\n<p>New File<br \/>\n<strong>Topsy-76-56<\/strong><\/p>\n<p>56 ends at Y3040<\/p>\n<p>Expand to include next 3 to 010<\/p>\n<p>rename file to 53<\/p>\n<p><strong>Topsy-76-53<\/strong><\/p>\n<p>Remaining to process:<\/p>\n<p>52 through 21<\/p>\n<p>52 should start out Y2 data.<\/p>\n<p>Remember to start from here:<br \/>\n<a href=\"https:\/\/sites.google.com\/site\/mountainsol\/\">https:\/\/sites.google.com\/site\/mountainsol\/<\/a><\/p>\n<p>Create new file:<strong><strong>Topsy-52-21\u00a0<\/strong><\/strong><\/p>\n<p>*note this starts out year 2 data.Stops at Y2140.<\/p>\n<p>Need to check naming to see if I missed anything.<\/p>\n<p>Also, how is 1 &#8211; 20 stored?<\/p>\n<p>http:\/\/topsy.com\/s?q=%40DataONEorg&#038;type=tweet&#038;sort=date&#038;offset=70&#038;mintime=1280664024&#038;maxtime=1312113651<\/p>\n<p>No tweets found. Why?<\/p>\n<p>Apparently URL 7 exceeded the range of dates for Year 1, so no tweets were found.<\/p>\n<p>URL 6 was the last in Y1 with twitter data; URL 8 starts year 2.<\/p>\n<p>So, file Y1070 should have no data.\u00a0 Rather than delete the file I&#8217;m going to populate it with -9999.<\/p>\n<p>I&#8217;m now satisfied that I have all the tweets.\u00a0 Since I did not use a program to do this, I am not satisfied that I did not introduce human error. I want to download all of the files created in my personal Google Drive as .csv files to my work computer. I then want to see if there is a quality control program I can use (such as hashes) to determine if any of the files are accidental duplicates (text is converted to numbers; numbers are summed to reveal a unique value; non unique values indicate duplicate files).<\/p>\n<p>There are 59 files. Moved them to a Google Drive folder called &#8220;DataONE-Topsy&#8221;<\/p>\n<p>Added public visibility to the folder.<\/p>\n<p>https:\/\/drive.google.com\/folderview?id=0B_9TV1q9zxYuc3UzTGVmS2ZFUUU&#038;usp=sharing<\/p>\n<p>In the download interface, I am informed there are 147 spreadsheets to download.<\/p>\n<p>I changed my Chrome settings to:<\/p>\n<p>C:\\Users\\tjessel\\Documents\\DataONE Research\\Twitter Data for DataONEorg\\Topsy-Data<\/p>\n<p>The resulting zipped file is 462 KB.<\/p>\n<p>The name of the file is &#8220;documents-export-2014-02-18.zip&#8221;<\/p>\n<p>Extract all to folder &#8220;C:\\Users\\tjessel\\Documents\\DataONE Research\\Twitter Data for DataONEorg\\Topsy-Data\\documents-export-2014-02-18&#8221;<\/p>\n<p>Most all of them ar 4 KB so nothing really obvious there.<\/p>\n<p>Now, software for processing?<\/p>\n<p>Searched Google for &#8220;hashing software&#8221; without quotes and found this:<\/p>\n<p>http:\/\/quickhash.sourceforge.net\/<\/p>\n<p>Google search for &#8220;hash files for quality control&#8221; without quotes brought this to my attention:<\/p>\n<p>http:\/\/www.turbosfv.com\/Download<\/p>\n<p>TurboSFV looks easy to use to me (and has a 30 day free trial).<\/p>\n<p>Assuming I have a 64 bit machine &#8211; TurboSFV x64 (64-bit)<\/p>\n<p>File went here:<\/p>\n<p>C:\\Users\\tjessel\\Downloads\\TurboSFV_PE_x64_5.zip<\/p>\n<p>I apparently do not have a 64 bit machine!<\/p>\n<p>Try the 32 bit version.<\/p>\n<p>TurboSFV x86 (32-bit)<\/p>\n<blockquote><p>With TurboSFV, you can create checksums for files, folders, and drives.<\/p><\/blockquote>\n<p>I follded the demo and created a SHA-224 file with the extension .sh2.<\/p>\n<p>The resultant file is 16 KB and the name is &#8220;Y1010&#8221;<\/p>\n<p>using Alt + PrintScreen I took a screen capture.\u00a0 See uploaded file. <a href=\"https:\/\/notebooks.dataone.org\/wp-content\/uploads\/2014\/02\/Hashes.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-1973\" alt=\"Hashes\" src=\"https:\/\/notebooks.dataone.org\/wp-content\/uploads\/2014\/02\/Hashes-300x187.png\" width=\"300\" height=\"187\" srcset=\"https:\/\/notebooks.dataone.org\/wp-content\/uploads\/2014\/02\/Hashes-300x187.png 300w, https:\/\/notebooks.dataone.org\/wp-content\/uploads\/2014\/02\/Hashes-1024x640.png 1024w, https:\/\/notebooks.dataone.org\/wp-content\/uploads\/2014\/02\/Hashes.png 1680w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>147 files were processed, and I had 147 spreadsheets.<\/p>\n<p>Now I think I need to validate them to see if any match.<\/p>\n<p>The total size of the files is 525,088 bytes.<\/p>\n<p>The maximum size is 3,835 bytes. The minimum size is 2,958 bytes.<\/p>\n<p>I&#8217;m looking for any duplicates.\u00a0 I don&#8217;t see an output where I can run &#8220;deduplicate&#8221; in a spreadsheet, so visual inspection will have to do.<\/p>\n<p>Several files have identical checksums:<\/p>\n<p>Y4060; Y3060<\/p>\n<p>Y3140; Y3020<\/p>\n<p>Y3110; Y2140<\/p>\n<p>Y3500; Y3310<\/p>\n<p>Y3350; Y1020<\/p>\n<p>Y3550; Y3520; Y3460<\/p>\n<p>Y4320; Y4140<\/p>\n<p>Y3250; Y2380; Y2310;<\/p>\n<p>Y2290; Y2230<\/p>\n<p>There are a lot and I want to spot check them first to see if this is an accurate way of gauging if files have identical content.\u00a0 Originally I was going to look at the hashes but I realized that if I named them different things, then the total size would be different even if the data was the same.<\/p>\n<p>Several files have identical checksums &#8211; all were &#8220;OK.&#8221;<\/p>\n<p>However, no files should have identical hashes &#8211; correct? One way to test might be to run it again and see if the numbers match. I don&#8217;t think that will work.<\/p>\n<p>We went over the ideas of checksums in the Environmental Management Institute &#8211; I may need to check with someone (pun not intended but enjoyed) on the concept of checksums to make sure I&#8217;m looking at this correctly.<\/p>\n<p>Could also run on two files with identical content to test out the tool.\u00a0 Definitely worth looking at for validation of many files.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Continuation and completion of harvesting with quality control \/ assurance exploration using hashes and checksum software. 5 months agoReplyRetweetFavorite1 more Start 97 &#8211; 77 97 contains year 3 and offset 450 Start at 12:05 Save text file Topsy-97-77 End at 12:21 New File Topsy-76-56 56 ends at Y3040 Expand to <a class=\"more-link\" href=\"https:\/\/notebooks.dataone.org\/data-science\/continue-scraping-introduce-quality-control-with-hashes\/\">Continue reading <span class=\"screen-reader-text\">  Continue Scraping, Introduce Quality Control with Hashes<\/span><span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[238,239,240,192,241,235],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1962"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=1962"}],"version-history":[{"count":6,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1962\/revisions"}],"predecessor-version":[{"id":1977,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1962\/revisions\/1977"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=1962"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=1962"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=1962"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}