{"id":1962,"date":"2014-02-18T17:55:06","date_gmt":"2014-02-18T17:55:06","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1962"},"modified":"2014-02-18T17:56:44","modified_gmt":"2014-02-18T17:56:44","slug":"continue-scraping-introduce-quality-control-with-hashes","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/continue-scraping-introduce-quality-control-with-hashes\/","title":{"rendered":"Continue Scraping, Introduce Quality Control with Hashes"},"content":{"rendered":"
Continuation and completion of harvesting with quality control \/ assurance exploration using hashes and checksum software.<\/p>\n
5 months agoReplyRetweetFavorite1 more<\/p>\n
Start 97 – 77<\/strong><\/p>\n 97 contains year 3 and offset 450<\/p>\n Start at 12:05<\/p>\n Save text file End at 12:21<\/p>\n New File 56 ends at Y3040<\/p>\n Expand to include next 3 to 010<\/p>\n rename file to 53<\/p>\n Topsy-76-53<\/strong><\/p>\n Remaining to process:<\/p>\n 52 through 21<\/p>\n 52 should start out Y2 data.<\/p>\n Remember to start from here: Create new file:Topsy-52-21\u00a0<\/strong><\/strong><\/p>\n *note this starts out year 2 data.Stops at Y2140.<\/p>\n Need to check naming to see if I missed anything.<\/p>\n Also, how is 1 – 20 stored?<\/p>\n http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=70&mintime=1280664024&maxtime=1312113651<\/p>\n No tweets found. Why?<\/p>\n Apparently URL 7 exceeded the range of dates for Year 1, so no tweets were found.<\/p>\n URL 6 was the last in Y1 with twitter data; URL 8 starts year 2.<\/p>\n So, file Y1070 should have no data.\u00a0 Rather than delete the file I’m going to populate it with -9999.<\/p>\n I’m now satisfied that I have all the tweets.\u00a0 Since I did not use a program to do this, I am not satisfied that I did not introduce human error. I want to download all of the files created in my personal Google Drive as .csv files to my work computer. I then want to see if there is a quality control program I can use (such as hashes) to determine if any of the files are accidental duplicates (text is converted to numbers; numbers are summed to reveal a unique value; non unique values indicate duplicate files).<\/p>\n There are 59 files. Moved them to a Google Drive folder called “DataONE-Topsy”<\/p>\n Added public visibility to the folder.<\/p>\n https:\/\/drive.google.com\/folderview?id=0B_9TV1q9zxYuc3UzTGVmS2ZFUUU&usp=sharing<\/p>\n In the download interface, I am informed there are 147 spreadsheets to download.<\/p>\n I changed my Chrome settings to:<\/p>\n C:\\Users\\tjessel\\Documents\\DataONE Research\\Twitter Data for DataONEorg\\Topsy-Data<\/p>\n The resulting zipped file is 462 KB.<\/p>\n The name of the file is “documents-export-2014-02-18.zip”<\/p>\n Extract all to folder “C:\\Users\\tjessel\\Documents\\DataONE Research\\Twitter Data for DataONEorg\\Topsy-Data\\documents-export-2014-02-18”<\/p>\n Most all of them ar 4 KB so nothing really obvious there.<\/p>\n Now, software for processing?<\/p>\n Searched Google for “hashing software” without quotes and found this:<\/p>\n http:\/\/quickhash.sourceforge.net\/<\/p>\n Google search for “hash files for quality control” without quotes brought this to my attention:<\/p>\n http:\/\/www.turbosfv.com\/Download<\/p>\n TurboSFV looks easy to use to me (and has a 30 day free trial).<\/p>\n Assuming I have a 64 bit machine – TurboSFV x64 (64-bit)<\/p>\n File went here:<\/p>\n C:\\Users\\tjessel\\Downloads\\TurboSFV_PE_x64_5.zip<\/p>\n I apparently do not have a 64 bit machine!<\/p>\n Try the 32 bit version.<\/p>\n TurboSFV x86 (32-bit)<\/p>\n With TurboSFV, you can create checksums for files, folders, and drives.<\/p><\/blockquote>\n I follded the demo and created a SHA-224 file with the extension .sh2.<\/p>\n The resultant file is 16 KB and the name is “Y1010”<\/p>\n using Alt + PrintScreen I took a screen capture.\u00a0 See uploaded file. <\/a><\/p>\n 147 files were processed, and I had 147 spreadsheets.<\/p>\n Now I think I need to validate them to see if any match.<\/p>\n The total size of the files is 525,088 bytes.<\/p>\n The maximum size is 3,835 bytes. The minimum size is 2,958 bytes.<\/p>\n I’m looking for any duplicates.\u00a0 I don’t see an output where I can run “deduplicate” in a spreadsheet, so visual inspection will have to do.<\/p>\n Several files have identical checksums:<\/p>\n Y4060; Y3060<\/p>\n Y3140; Y3020<\/p>\n Y3110; Y2140<\/p>\n Y3500; Y3310<\/p>\n Y3350; Y1020<\/p>\n Y3550; Y3520; Y3460<\/p>\n Y4320; Y4140<\/p>\n Y3250; Y2380; Y2310;<\/p>\n Y2290; Y2230<\/p>\n There are a lot and I want to spot check them first to see if this is an accurate way of gauging if files have identical content.\u00a0 Originally I was going to look at the hashes but I realized that if I named them different things, then the total size would be different even if the data was the same.<\/p>\n Several files have identical checksums – all were “OK.”<\/p>\n However, no files should have identical hashes – correct? One way to test might be to run it again and see if the numbers match. I don’t think that will work.<\/p>\n We went over the ideas of checksums in the Environmental Management Institute – I may need to check with someone (pun not intended but enjoyed) on the concept of checksums to make sure I’m looking at this correctly.<\/p>\n Could also run on two files with identical content to test out the tool.\u00a0 Definitely worth looking at for validation of many files.<\/p>\n <\/p>\n <\/p>\n <\/p>\n <\/p>\n <\/p>\n <\/p>\n <\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" Continuation and completion of harvesting with quality control \/ assurance exploration using hashes and checksum software. 5 months agoReplyRetweetFavorite1 more Start 97 – 77 97 contains year 3 and offset 450 Start at 12:05 Save text file Topsy-97-77 End at 12:21 New File Topsy-76-56 56 ends at Y3040 Expand to Continue reading Continue Scraping, Introduce Quality Control with Hashes<\/span>
\nTopsy-97-77<\/strong><\/p>\n
\nTopsy-76-56<\/strong><\/p>\n
\nhttps:\/\/sites.google.com\/site\/mountainsol\/<\/a><\/p>\n