Continue Scraping, Introduce Quality Control with Hashes

Continuation and completion of harvesting with quality control / assurance exploration using hashes and checksum software.

5 months agoReplyRetweetFavorite1 more

Start 97 – 77

97 contains year 3 and offset 450

Start at 12:05

Save text file
Topsy-97-77

End at 12:21

New File
Topsy-76-56

56 ends at Y3040

Expand to include next 3 to 010

rename file to 53

Topsy-76-53

Remaining to process:

52 through 21

52 should start out Y2 data.

Remember to start from here:
https://sites.google.com/site/mountainsol/

Create new file:Topsy-52-21ย 

*note this starts out year 2 data.Stops at Y2140.

Need to check naming to see if I missed anything.

Also, how is 1 – 20 stored?

http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=70&mintime=1280664024&maxtime=1312113651

No tweets found. Why?

Apparently URL 7 exceeded the range of dates for Year 1, so no tweets were found.

URL 6 was the last in Y1 with twitter data; URL 8 starts year 2.

So, file Y1070 should have no data.ย  Rather than delete the file I’m going to populate it with -9999.

I’m now satisfied that I have all the tweets.ย  Since I did not use a program to do this, I am not satisfied that I did not introduce human error. I want to download all of the files created in my personal Google Drive as .csv files to my work computer. I then want to see if there is a quality control program I can use (such as hashes) to determine if any of the files are accidental duplicates (text is converted to numbers; numbers are summed to reveal a unique value; non unique values indicate duplicate files).

There are 59 files. Moved them to a Google Drive folder called “DataONE-Topsy”

Added public visibility to the folder.

https://drive.google.com/folderview?id=0B_9TV1q9zxYuc3UzTGVmS2ZFUUU&usp=sharing

In the download interface, I am informed there are 147 spreadsheets to download.

I changed my Chrome settings to:

C:\Users\tjessel\Documents\DataONE Research\Twitter Data for DataONEorg\Topsy-Data

The resulting zipped file is 462 KB.

The name of the file is “documents-export-2014-02-18.zip”

Extract all to folder “C:\Users\tjessel\Documents\DataONE Research\Twitter Data for DataONEorg\Topsy-Data\documents-export-2014-02-18”

Most all of them ar 4 KB so nothing really obvious there.

Now, software for processing?

Searched Google for “hashing software” without quotes and found this:

http://quickhash.sourceforge.net/

Google search for “hash files for quality control” without quotes brought this to my attention:

http://www.turbosfv.com/Download

TurboSFV looks easy to use to me (and has a 30 day free trial).

Assuming I have a 64 bit machine – TurboSFV x64 (64-bit)

File went here:

C:\Users\tjessel\Downloads\TurboSFV_PE_x64_5.zip

I apparently do not have a 64 bit machine!

Try the 32 bit version.

TurboSFV x86 (32-bit)

With TurboSFV, you can create checksums for files, folders, and drives.

I follded the demo and created a SHA-224 file with the extension .sh2.

The resultant file is 16 KB and the name is “Y1010”

using Alt + PrintScreen I took a screen capture.ย  See uploaded file. Hashes

147 files were processed, and I had 147 spreadsheets.

Now I think I need to validate them to see if any match.

The total size of the files is 525,088 bytes.

The maximum size is 3,835 bytes. The minimum size is 2,958 bytes.

I’m looking for any duplicates.ย  I don’t see an output where I can run “deduplicate” in a spreadsheet, so visual inspection will have to do.

Several files have identical checksums:

Y4060; Y3060

Y3140; Y3020

Y3110; Y2140

Y3500; Y3310

Y3350; Y1020

Y3550; Y3520; Y3460

Y4320; Y4140

Y3250; Y2380; Y2310;

Y2290; Y2230

There are a lot and I want to spot check them first to see if this is an accurate way of gauging if files have identical content.ย  Originally I was going to look at the hashes but I realized that if I named them different things, then the total size would be different even if the data was the same.

Several files have identical checksums – all were “OK.”

However, no files should have identical hashes – correct? One way to test might be to run it again and see if the numbers match. I don’t think that will work.

We went over the ideas of checksums in the Environmental Management Institute – I may need to check with someone (pun not intended but enjoyed) on the concept of checksums to make sure I’m looking at this correctly.

Could also run on two files with identical content to test out the tool.ย  Definitely worth looking at for validation of many files.

 

 

 

 

 

 

 

 

About Tanner Jessel

I am a graduate research assistant funded by DataONE and pursuing a Masters in Information Sciences with an Interdisciplinary Graduate Minor in Computational Science. I assist scholarly research efforts supporting the Sociocultural, Usability and Assessment, and Member Nodes working groups within DataONE. I am based at the Center for Information and Communication Studies at the University of Tennessee School of Information Science in Knoxville, Tennessee.

Leave a Reply

Your email address will not be published. Required fields are marked *

*