{"id":1962,"date":"2014-02-18T17:55:06","date_gmt":"2014-02-18T17:55:06","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1962"},"modified":"2014-02-18T17:56:44","modified_gmt":"2014-02-18T17:56:44","slug":"continue-scraping-introduce-quality-control-with-hashes","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/continue-scraping-introduce-quality-control-with-hashes\/","title":{"rendered":"Continue Scraping, Introduce Quality Control with Hashes"},"content":{"rendered":"

Continuation and completion of harvesting with quality control \/ assurance exploration using hashes and checksum software.<\/p>\n

5 months agoReplyRetweetFavorite1 more<\/p>\n

Start 97 – 77<\/strong><\/p>\n

97 contains year 3 and offset 450<\/p>\n

Start at 12:05<\/p>\n

Save text file
\nTopsy-97-77<\/strong><\/p>\n

End at 12:21<\/p>\n

New File
\nTopsy-76-56<\/strong><\/p>\n

56 ends at Y3040<\/p>\n

Expand to include next 3 to 010<\/p>\n

rename file to 53<\/p>\n

Topsy-76-53<\/strong><\/p>\n

Remaining to process:<\/p>\n

52 through 21<\/p>\n

52 should start out Y2 data.<\/p>\n

Remember to start from here:
\nhttps:\/\/sites.google.com\/site\/mountainsol\/<\/a><\/p>\n

Create new file:Topsy-52-21\u00a0<\/strong><\/strong><\/p>\n

*note this starts out year 2 data.Stops at Y2140.<\/p>\n

Need to check naming to see if I missed anything.<\/p>\n

Also, how is 1 – 20 stored?<\/p>\n

http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=70&mintime=1280664024&maxtime=1312113651<\/p>\n

No tweets found. Why?<\/p>\n

Apparently URL 7 exceeded the range of dates for Year 1, so no tweets were found.<\/p>\n

URL 6 was the last in Y1 with twitter data; URL 8 starts year 2.<\/p>\n

So, file Y1070 should have no data.\u00a0 Rather than delete the file I’m going to populate it with -9999.<\/p>\n

I’m now satisfied that I have all the tweets.\u00a0 Since I did not use a program to do this, I am not satisfied that I did not introduce human error. I want to download all of the files created in my personal Google Drive as .csv files to my work computer. I then want to see if there is a quality control program I can use (such as hashes) to determine if any of the files are accidental duplicates (text is converted to numbers; numbers are summed to reveal a unique value; non unique values indicate duplicate files).<\/p>\n

There are 59 files. Moved them to a Google Drive folder called “DataONE-Topsy”<\/p>\n

Added public visibility to the folder.<\/p>\n

https:\/\/drive.google.com\/folderview?id=0B_9TV1q9zxYuc3UzTGVmS2ZFUUU&usp=sharing<\/p>\n

In the download interface, I am informed there are 147 spreadsheets to download.<\/p>\n

I changed my Chrome settings to:<\/p>\n

C:\\Users\\tjessel\\Documents\\DataONE Research\\Twitter Data for DataONEorg\\Topsy-Data<\/p>\n

The resulting zipped file is 462 KB.<\/p>\n

The name of the file is “documents-export-2014-02-18.zip”<\/p>\n

Extract all to folder “C:\\Users\\tjessel\\Documents\\DataONE Research\\Twitter Data for DataONEorg\\Topsy-Data\\documents-export-2014-02-18”<\/p>\n

Most all of them ar 4 KB so nothing really obvious there.<\/p>\n

Now, software for processing?<\/p>\n

Searched Google for “hashing software” without quotes and found this:<\/p>\n

http:\/\/quickhash.sourceforge.net\/<\/p>\n

Google search for “hash files for quality control” without quotes brought this to my attention:<\/p>\n

http:\/\/www.turbosfv.com\/Download<\/p>\n

TurboSFV looks easy to use to me (and has a 30 day free trial).<\/p>\n

Assuming I have a 64 bit machine – TurboSFV x64 (64-bit)<\/p>\n

File went here:<\/p>\n

C:\\Users\\tjessel\\Downloads\\TurboSFV_PE_x64_5.zip<\/p>\n

I apparently do not have a 64 bit machine!<\/p>\n

Try the 32 bit version.<\/p>\n

TurboSFV x86 (32-bit)<\/p>\n

With TurboSFV, you can create checksums for files, folders, and drives.<\/p><\/blockquote>\n

I follded the demo and created a SHA-224 file with the extension .sh2.<\/p>\n

The resultant file is 16 KB and the name is “Y1010”<\/p>\n

using Alt + PrintScreen I took a screen capture.\u00a0 See uploaded file. \"Hashes\"<\/a><\/p>\n

147 files were processed, and I had 147 spreadsheets.<\/p>\n

Now I think I need to validate them to see if any match.<\/p>\n

The total size of the files is 525,088 bytes.<\/p>\n

The maximum size is 3,835 bytes. The minimum size is 2,958 bytes.<\/p>\n

I’m looking for any duplicates.\u00a0 I don’t see an output where I can run “deduplicate” in a spreadsheet, so visual inspection will have to do.<\/p>\n

Several files have identical checksums:<\/p>\n

Y4060; Y3060<\/p>\n

Y3140; Y3020<\/p>\n

Y3110; Y2140<\/p>\n

Y3500; Y3310<\/p>\n

Y3350; Y1020<\/p>\n

Y3550; Y3520; Y3460<\/p>\n

Y4320; Y4140<\/p>\n

Y3250; Y2380; Y2310;<\/p>\n

Y2290; Y2230<\/p>\n

There are a lot and I want to spot check them first to see if this is an accurate way of gauging if files have identical content.\u00a0 Originally I was going to look at the hashes but I realized that if I named them different things, then the total size would be different even if the data was the same.<\/p>\n

Several files have identical checksums – all were “OK.”<\/p>\n

However, no files should have identical hashes – correct? One way to test might be to run it again and see if the numbers match. I don’t think that will work.<\/p>\n

We went over the ideas of checksums in the Environmental Management Institute – I may need to check with someone (pun not intended but enjoyed) on the concept of checksums to make sure I’m looking at this correctly.<\/p>\n

Could also run on two files with identical content to test out the tool.\u00a0 Definitely worth looking at for validation of many files.<\/p>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n","protected":false},"excerpt":{"rendered":"

Continuation and completion of harvesting with quality control \/ assurance exploration using hashes and checksum software. 5 months agoReplyRetweetFavorite1 more Start 97 – 77 97 contains year 3 and offset 450 Start at 12:05 Save text file Topsy-97-77 End at 12:21 New File Topsy-76-56 56 ends at Y3040 Expand to Continue reading Continue Scraping, Introduce Quality Control with Hashes<\/span>→<\/span><\/a><\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[238,239,240,192,241,235],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1962"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=1962"}],"version-history":[{"count":6,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1962\/revisions"}],"predecessor-version":[{"id":1977,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1962\/revisions\/1977"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=1962"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=1962"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=1962"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}