Continuation and completion of harvesting with quality control / assurance exploration using hashes and checksum software.
5 months agoReplyRetweetFavorite1 more
Start 97 – 77
97 contains year 3 and offset 450
Start at 12:05
Save text file
End at 12:21
56 ends at Y3040
Expand to include next 3 to 010
rename file to 53
Remaining to process:
52 through 21
52 should start out Y2 data.
Remember to start from here:
Create new file:Topsy-52-21
*note this starts out year 2 data.Stops at Y2140.
Need to check naming to see if I missed anything.
Also, how is 1 – 20 stored?
No tweets found. Why?
Apparently URL 7 exceeded the range of dates for Year 1, so no tweets were found.
URL 6 was the last in Y1 with twitter data; URL 8 starts year 2.
So, file Y1070 should have no data. Rather than delete the file I’m going to populate it with -9999.
I’m now satisfied that I have all the tweets. Since I did not use a program to do this, I am not satisfied that I did not introduce human error. I want to download all of the files created in my personal Google Drive as .csv files to my work computer. I then want to see if there is a quality control program I can use (such as hashes) to determine if any of the files are accidental duplicates (text is converted to numbers; numbers are summed to reveal a unique value; non unique values indicate duplicate files).
There are 59 files. Moved them to a Google Drive folder called “DataONE-Topsy”
Added public visibility to the folder.
In the download interface, I am informed there are 147 spreadsheets to download.
I changed my Chrome settings to:
C:\Users\tjessel\Documents\DataONE Research\Twitter Data for DataONEorg\Topsy-Data
The resulting zipped file is 462 KB.
The name of the file is “documents-export-2014-02-18.zip”
Extract all to folder “C:\Users\tjessel\Documents\DataONE Research\Twitter Data for DataONEorg\Topsy-Data\documents-export-2014-02-18”
Most all of them ar 4 KB so nothing really obvious there.
Now, software for processing?
Searched Google for “hashing software” without quotes and found this:
Google search for “hash files for quality control” without quotes brought this to my attention:
TurboSFV looks easy to use to me (and has a 30 day free trial).
Assuming I have a 64 bit machine – TurboSFV x64 (64-bit)
File went here:
I apparently do not have a 64 bit machine!
Try the 32 bit version.
TurboSFV x86 (32-bit)
With TurboSFV, you can create checksums for files, folders, and drives.
I follded the demo and created a SHA-224 file with the extension .sh2.
The resultant file is 16 KB and the name is “Y1010”
147 files were processed, and I had 147 spreadsheets.
Now I think I need to validate them to see if any match.
The total size of the files is 525,088 bytes.
The maximum size is 3,835 bytes. The minimum size is 2,958 bytes.
I’m looking for any duplicates. I don’t see an output where I can run “deduplicate” in a spreadsheet, so visual inspection will have to do.
Several files have identical checksums:
Y3550; Y3520; Y3460
Y3250; Y2380; Y2310;
There are a lot and I want to spot check them first to see if this is an accurate way of gauging if files have identical content. Originally I was going to look at the hashes but I realized that if I named them different things, then the total size would be different even if the data was the same.
Several files have identical checksums – all were “OK.”
However, no files should have identical hashes – correct? One way to test might be to run it again and see if the numbers match. I don’t think that will work.
We went over the ideas of checksums in the Environmental Management Institute – I may need to check with someone (pun not intended but enjoyed) on the concept of checksums to make sure I’m looking at this correctly.
Could also run on two files with identical content to test out the tool. Definitely worth looking at for validation of many files.