Consolidating Year 1 – Year 4 @DataONEorg Tweets

I am continuing quality control efforts today.

From looking at checksums for the files, some of the 147 appear to be the same. This concerns me due to the possibility of human error (my error) in creating the files, since I scraped tweets manually with a browser extension, rather than via a computer programming language like Python.

Even though the files should all be about the same size, since there are 10 tweets per page and 140 characters max per tweet, the presences of a few identical file sizes concerns me. Still, it’s worth spot checking for some, such as “Y4060” and “Y3060” which are not very different.

Opening up the folder where I put them all, I’m going to spot check the few I identified yesterday as having identical file sizes, starting with Y4060 and Y3060.

Y4060 starts with “DataFOUR” while Y3060 starts with “Mark Gollan.” I’m relieved that these two, which seem the likeliest to have human error, are not the same files.

I think the easiest thing to do is to collate all of the spreadsheets into one file, then see if there are any duplicates.

Possible options I have in mind:

a) Combine in Google Drive

b) combine in desktop spreadsheet software (Excel, Open Office)

c) combine CSV files.

I did a google search for “combine spreadsheets in google drive” without quotes. Two results came up.

I found a pretty nice explanation of a “range function” here:

http://www.jellybend.com/2013/01/10/merge-multiple-google-documents/

The ID of the document is the string between “../d/” and “/edit”, in this example the ID is: “1uWocGqA7Bifl61Vu8-TOWpALDHic8gbc5oCuZivZ1Dg”. All you have to do is to put this ID from cell A5 of the Spreadsheet document…

Looks great, but obtaining the unique ID from all 147 documents is still kind of a pain.

I made a copy of the spreadsheet to possibly use later: https://docs.google.com/spreadsheet/ccc?key=0Av9TV1q9zxYudExVeV90bWNPcGxLZE9kTmctX3BNR1E#gid=0

For the third option concerning merging CSV files, I did a google search for “merge multiple CSV files” without quotes.

The first result was interesting, “Merge CSV Files Into One Large CSV File In Windows 7 – Solve.”

I also saw a result at 6 for “How to merge multiple CSV-files into one with Mac OS X terminal.”

I am a bit annoyed that Google Drive won’t let me download my spreadsheets as raw CSV – is it possible there is a function for that? Worth looking… an article mentions “Using the google drive API to download a spreadsheet in CSV format”

Again, I am still going to need the Document IDs, which is annoying, time consuming, and possibly error prone.

docs.google.com/feeds/download/spreadsheets/Export?key<FILE_ID>&exportFormat=csv&gid=0

Since I'm on a Mac at the moment, I think the fastest way is going to be "Merge multiple .csv files."

I’d like to open up all 147 spreadsheets at the same time, then systematically save them as .csv.

I created a new folder in “Documents” and “DataONE” on my local machine.

I downloaded all 147 files as spreadsheets for Microsoft Excel. I’m optimistic my computer has enough memory to open up all 147 files at once, then systematically save them as .csv files. Another possibility: could I perhaps drag all 147 .xls files into TextWrangler? I’ll try that. Definitely not. Had to force quit.

At the risk of freezing up my computer, let me try and open all 147 files in Excel. Start 1:09 pm.

Ended at about 1:14 – took a while to load all those files. On the Mac I can just do Command + Shift + S to “save as” but then I have to select the filetype as .csv from a drop-down. Somewhat annoying but it’s only 147 so it’s not implausible to do by hand.

Honestly thought I just realized a problem with this approach – if I save as .csv, any file that contains a “,” character within the tweet will go into a new folder.

Is it possible to save as a tab delimited file?

I may have to end up collecting all of the 147 URLs anyway, to use the range function.

Under “specialty formats” there is “tab delimited text file” – so that would work in terms of preserving comma content in the original tweets. However, I still have the problem of merging all the files into one.

For this reason, begrudgingly, I think will collect the 147 document IDs to use the import range function.

I do have one question though. Can I change the tab-delimited text file to a .csv, to use the merge function in Terminal? Let me generate two tab-delimited text files to see what happens.

I don’t think they are useful due to the comma content problem, so I am going to delete these .csv files that I just created:

Y438; Y4370

Now, with the three files Y4340, Y4350, Y4360 open in TextWrangler, I will see if I can systematically save them as .csv files. I’m just replacing .txt with .csv when I save the document.

Key point: make sure there are actually commas in one of the documents to see how they behave:

There are no commas in Y4350

There are commas in Y4340

There are commas in Y4360.

So now I have my three files as .csv files. I’m moving them to a new folder to try and combine them all with the Mac OS X terminal method.

I created a new folder within Documents/DataONE/CSV-files

I moved the three files into this new folder.

I opened up Terminal.

I executed the command:

cat *.csv >merged.csv

Took a screen capture: Merged-csv-files

What’s the result? I have a file called “merged.csv” with 31 rows of content.

That makes sense since there is a title row, with 10 rows of tweets per file. If I follow through for all 147 files, I should have approximately 1470 rows of tweets. I can then look and see how these are arranged.

One problem is I am not sure how these will be ordered. Let’s look at the three files to see how they came out.

They are sequentially ordered, so I’m curious how the cat *.csv approached them.

Row 2 in the new merged.csv spreadsheet contains:

Matt Jones @metamattj Ò@djhocking Schildhauer: morpho software and knb (component of @DataONEorg) workflow to share and find data based on” EML metadata. #esa2013 6 months agoReplyRetweetFavorite

Row 11 contains:

kristina simonaityt_ @kristinasimona RT @JacquelynGill: Follow these tweeps in the #ESA2013 session on sharing in science: @ethanwhite @recology_ @metamattj @cjlortie @sandramcÉ 6 months agoReplyRetweetFavoriteTextRow 12 contains:

Jacquelyn Gill @jacquelyngill Follow these tweeps in the #ESA2013 session on sharing in science: @ethanwhite @recology_ @metamattj @cjlortie @sandramchung @dataoneorg 6 months agoReplyRetweetFavorite1 moreRow 21 contains:

Sandra M. Chung @sandramchung RT @DataONEorg: Looking fwd to ignite tomorrow. 8am a bit early for rapid-fire presentation but a great line-up @NEON @NCEAS @cjlortie @reÉ 6 months agoReplyRetweetFavoriteTextRow 22 contains:

Scott Chamberlain @recology_ RT @DataONEorg: Looking fwd to ignite tomorrow. 8am a bit early for rapid-fire presentation but a great line-up @NEON @NCEAS @cjlortie @reÉ 6 months agoReplyRetweetFavoriteRow 31 contains:

Leah A. Wasser @leahawasser Got a data mgmt plan? Great session now by @DataONEorg and @nceas #esa2013 6 months agoReplyRetweetFavorite1 moreI am a bit concerned about what these special characters are – for example “@reÉ” If characters are being altered or corrupted, that could impact sentiment analysis.A spot of good news is that commas will be preserved – notice “data, metadata, and download.” are preserved in one row.

Daniel Hocking @djhocking Budden: can use nemercury @DataONEorg search to find data, metadata, and download. Nice map search too. #ESA2013 #ignite #openscience 6 months agoReplyRetweetFavoriteSo let’s look at the three .csv files and see what they start and end with to see what order they were processed in, and identify the problem with the special characters (verify there is a problem). I’ll open them in TextWrangler.Y4350 starts with matt and ends with kristina (row 2 – 11).

Y4350 starts with Jacquelin and ends with Sandra. (row 12 – 21).

The E’ special character appears to be an ellipsis. Unfortunately, URLs were not preserved with the scraping method (Twitter shortens URLS). For example, a short URL shared on Twitter normally would be “t.co” but even this short URL might be too long.

Essentially this body of work provides access to text and sentiments, but will likely leave out URLs. Another method will need to be devised to extract URLs.

Another change I notice is “kristina simonaityt_” from kristina simonaitytė. I’m not sure if there is a workaround for preserving that, or if Google Drive preserved the special character in the first place.

I’m also going back to sites.google.com to fine the URL with the original tweet from with the @Re… – That will be Y4350. That corresponds to URL 143 at <https://sites.google.com/site/mountainsol/>

“SHOULD” correspond. It does not. I’m a bit confused. Row 22 contains the special character. Row 22 will be in – ah. Year 4, offset 360. So, that should be URL 144 from <https://sites.google.com/site/mountainsol/>.

Ok, I changed the offset key to 360:

http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=360&mintime=1375358424&maxtime=1391515257

Ah still not there – very confused. But using search I located the tweet from row 21 in question:

RT @DataONEorg: Looking fwd to ignite tomorrow. 8am a bit early for rapid-fire presentation but a great line-up@NEON @NCEAS @cjlortie @re…

It looks like “Re…” was something else in the original tweet (@recology_) from @DataONEorg, and either Topsy or Twitter filtered it once it exceeded the 140 characters allowed.

Here’s the original tweet:

Looking fwd to ignite tomorrow. 8am a bit early for rapid-fire presentation but a great line-up @NEON @NCEAS @cjlortie @recology_ #ESA2013

— DataONE (@DataONEorg) August 5, 2013

https://twitter.com/DataONEorg/status/364497908143755264

I am concerned that I can’t map the tweet back to the URL, but I don’t see that it’s a problem at the moment for what I’m doing, as long as the tweets are in order.

If processed as I expect, they will be in reverse chronological order. Y4 offset 350 are the newest tweets of year 4; Y4 offset 010 are the oldest tweets of year 4.

At the moment, I have established that special characters are not handled well by the process I have outlined. Let me see how the ellipses are stored in the Excel files I obtained from Google Docs, starting with Y4. In the text / csv file, ellipses are represented as the character “É”. I’m looking at row 4 and Row 3 in Y4340 right now. Row 3 has “Ê” for some reason. Let’s look at the .xls file, row 3 and 4:

Row 3:

Matt Jones @metamattj @JacquelynGill #ESA2013 Agreed re: commenting, but #knb and other @DataONEorg repos are open and non-proprietary. 6 months agoReplyRetweetFavorite

Row 4:

Carly Strasser @carlystrasser AGU abstract deadline extended to TOMORROW 6p. Submit to our session on managing data! fallmeeting.agu.org/2013/scientifi… cc @_inundata @DataONEorg 6 months agoReplyRetweetFavorite1 more

Here, it’s obvious that the ellipses are still preserved. It’s likely I should have converted the columns to “text only” – I confirmed that these are “General” formatted columns.

Let me see how long it takes to convert the columns from “general” to “text only” and see if that will make a difference for me.

I converted Y4340 to text. There is an ellipsis at row 4. Row 11 has a special character, ė.

Note: Keyboard shortcut for “format cells” is command + 1.

Y4330 has ellipses at row 8.

Missed one before 4210 will have to find out which one it is.

For some reason I have a spreadsheet called “Y4170(2)” so that means there is a duplicate. Why?

There is definitely a Y4140.xlsx but I don’t know if it is the same, and it’s not worth opening the other one so I’ll just process it as normal and check it later.

I’m going to stop at Y4060, open up the tab-delimited text files I have saved so far and see if it is worth continuing (check for odd characters). Basically I’m wondering if the range function with Google Docs is not a better option for me. I have saved 31 so far and really don’t like doing it manually.

Special characters are great and preserved (ellipses are preserved, non ASCII characters are preserved such as “Pau Aragó”, it’s just really annoying to do command 1, save as text, and then convert to tab-delimited file.

I expect it’s faster to copy and paste out the unique IDs. I might open them in chrome with the Google Chrome extension. Worried about stopping mid-way with all these windows open. But, it appears I’ve processed Y4360 through Y4080, so I can just start again and exclude those if my computer crashes.

I’m on my home computer. Don’t have the extension. Reference previous lab notebook entry: https://notebooks.dataone.org/data-science/scraping-dataoneorg-tweets-off-the-web-with-browser-extensions/

I mention two: link miner and Linkclumper. Looks like I installed “Linkclump” on my CICS workstation. I’ll install that now on my home computer.

https://chrome.google.com/webstore/detail/linkclump/

The other option to explore is this: Is there something that will “fetch” all of the unique Google Spreadsheet IDs from one folder? In this case, fetch all the document IDs from the DataONE-Topsy folder. The key here is to be systematic – which is why a program would be preferable.

I did a google search for “fetch all IDs from Google Drive” without quotes.

First two results interested me

https://developers.google.com/drive/v2/reference/files/get

I don’t think this helps me, but I do have a “parameter name” that I am interested in: fileId

I searched the developers site for this string: “get fileId for all files in folder” without quotes. That’s pretty straightforward to me, but the results are a bit beyond my skill level. However, I think this is pretty close to what I’m looking for: http://stackoverflow.com/questions/21681441/google-drive-file-id-and-folder-id-conventions. The call is “files.list.”

I think I’m going to download the desktop version of Google Drive to see if there is another way of working with that. Get it here: https://tools.google.com/dlpage/drive/index.html

installgoogledrive.dmg in downloads folder. 25 megabytes. Plus 3 GB of content from Google Drive. Honestly do not want that junk on my computer, that is why it is on Google Drive.

So, I think the solution is to open these files in tabs, and very carefully and systematically copy and paste the unique document id.

Probably will be faster, and LinkClump works, but it’s difficult to open all the files. because you can’t scroll. Essentially it’s limited to what you can see in your screen. I might wait until I can view them all on a larger screen (CICS workstation).

One Reply to “Consolidating Year 1 – Year 4 @DataONEorg Tweets”

Kyle on March 1, 2014 at 9:51 pm said:

I cannot believe how much of that you did by hand, the agony! I’ve found myself doing similar data processing for fun mostly, but a little for work; ordering, sorting, combining. So I followed the link to your data files and got a merged csv of all 147 files in no more than 10 minutes.

FWIW:
1 – I downloaded the whole folder from google drive as zipped container of xlsx files.
2 – Unzipped them and used:
`md5sum DataONE-Topsy/*.xlsx | sort | uniq -D -w 15`
to give me all the files that are EXACTLY the same, found only one pair Y4160, Y4170
3 – Used https://github.com/dilshod/xlsx2csv to convert them all to csv
`for i in *.xlsx; do xlsx2csv.py $i $i.csv; done`
Thankfully uses UTF-8 encoding so it retains the special characters and doesn’t convert the ellipsis
4 – Finally `cat *.csv > ../merged.csv`

Hope this gives you some ideas and helps save time in the future.

Reply ↓

About Tanner Jessel

One Reply to “Consolidating Year 1 – Year 4 @DataONEorg Tweets”

Leave a Reply Cancel reply