{"id":1979,"date":"2014-02-20T03:23:19","date_gmt":"2014-02-20T03:23:19","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1979"},"modified":"2014-02-20T03:23:19","modified_gmt":"2014-02-20T03:23:19","slug":"consolidating-year-1-year-4-dataoneorg-tweets","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/consolidating-year-1-year-4-dataoneorg-tweets\/","title":{"rendered":"Consolidating Year 1 – Year 4 @DataONEorg Tweets"},"content":{"rendered":"
I am continuing quality control efforts today.<\/p>\n
From looking at checksums for the files<\/a>, some of the 147 appear to be the same. This concerns me due to the possibility of human error (my error) in creating the files, since I scraped tweets manually with a browser extension, rather than via a computer programming language like Python.<\/p>\n Even though the files should all be about the same size, since there are 10 tweets per page and 140 characters max per tweet, the presences of a few identical file sizes concerns me. \u00a0Still, it’s worth spot checking for some, such as “Y4060” and “Y3060” which are not very different.<\/p>\n Opening up the folder where I put them all<\/a>, I’m going to spot check the few I identified yesterday as having identical file sizes, starting with Y4060 and Y3060.<\/p>\n Y4060 starts with “DataFOUR” while Y3060 starts with “Mark Gollan.” \u00a0I’m relieved that these two, which seem the likeliest to have human error, are not the same files.<\/p>\n I think the easiest thing to do is to collate all of the spreadsheets into one file, then see if there are any duplicates.<\/p>\n Possible options I have in mind:<\/p>\n a) Combine in Google Drive<\/p>\n b) combine in desktop spreadsheet software (Excel, Open Office)<\/p>\n c) combine CSV files.<\/p>\n I did a google search for “combine spreadsheets in google drive” without quotes. Two results came up.<\/p>\n I found a pretty nice explanation of a “range function” here:<\/p>\n http:\/\/www.jellybend.com\/2013\/01\/10\/merge-multiple-google-documents\/<\/a><\/p>\n The ID of the document is the string between \u201c..\/d\/\u201d and \u201c\/edit\u201d, in this example the ID is: \u201c1uWocGqA7Bifl61Vu8-TOWpALDHic8gbc5oCuZivZ1Dg\u201d. All you have to do is to put this ID from cell A5 of the Spreadsheet document…<\/p><\/blockquote>\n Looks great, but obtaining the unique ID from all 147 documents is still kind of a pain.<\/p>\n I made a copy of the spreadsheet to possibly use later:\u00a0https:\/\/docs.google.com\/spreadsheet\/ccc?key=0Av9TV1q9zxYudExVeV90bWNPcGxLZE9kTmctX3BNR1E#gid=0<\/p>\n For the third option concerning merging CSV files, I did a google search for “merge multiple CSV files<\/a>” without quotes.\u00a0 The first result was interesting, “Merge CSV Files Into One Large CSV File In Windows 7 – Solve<\/a>.”<\/p>\n I also saw a result at 6 for “How to merge multiple CSV-files into one with Mac OS X terminal<\/a>.”<\/p>\n I am a bit annoyed that Google Drive won’t let me download my spreadsheets as raw CSV – is it possible there is a function for that? Worth looking… \u00a0an article mentions “Using the google drive API to download a spreadsheet in CSV format<\/a>”<\/p>\n Again, I am still going to need the Document IDs, which is annoying, time consuming, and possibly error prone.<\/p>\n I’d like to open up all 147 spreadsheets at the same time, then systematically save them as .csv.<\/p>\n I created a new folder in “Documents” and “DataONE” on my local machine.<\/p>\n I downloaded all 147 files as spreadsheets for Microsoft Excel. I’m optimistic my computer has enough memory to open up all 147 files at once, then systematically save them as .csv files. Another possibility: could I perhaps drag all 147 .xls files into TextWrangler<\/a>? I’ll try that. Definitely not. Had to force quit.<\/p>\n At the risk of freezing up my computer, let me try and open all 147 files in Excel. Start 1:09 pm.<\/p>\n Ended at about 1:14 – took a while to load all those files. \u00a0On the Mac I can just do Command + Shift + S to “save as” but then I have to select the filetype as .csv from a drop-down. \u00a0Somewhat annoying but it’s only 147 so it’s not implausible to do by hand.<\/p>\n Honestly thought I just realized a problem with this approach – if I save as .csv, any file that contains a “,” character within the tweet will go into a new folder.<\/p>\n Is it possible to save as a tab delimited file?<\/p>\n I may have to end up collecting all of the 147 URLs anyway, to use the range function.<\/p>\n Under “specialty formats” there is “tab delimited text file” – so that would work in terms of preserving comma content in the original tweets. \u00a0However, I still have the problem of merging all the files into one.<\/p>\n For this reason, begrudgingly, I think will collect the 147 document IDs to use the import range function.<\/p>\n I do have one question though. Can I change the tab-delimited text file to a .csv, to use the merge function in Terminal? Let me generate two tab-delimited text files to see what happens.<\/p>\n I don’t think they are useful due to the comma content problem, so I am going to delete these .csv files that I just created:<\/p>\n Y438; Y4370<\/p>\n Now, with the three files Y4340, Y4350, Y4360 open in TextWrangler, I will see if I can systematically save them as .csv files. I’m just replacing .txt with .csv when I save the document.<\/p>\n Key point: make sure there are actually commas in one of the documents to see how they behave:<\/p>\n There are no commas in Y4350<\/p>\n There are commas in Y4340<\/p>\n There are commas in Y4360.<\/p>\n So now I have my three files as .csv files. I’m moving them to a new folder to try and combine them all with the Mac OS X terminal method.<\/p>\n I created a new folder within Documents\/DataONE\/CSV-files<\/p>\n I moved the three files into this new folder.<\/p>\n I opened up Terminal.<\/p>\n I executed the command:<\/p>\n cat *.csv >merged.csv<\/p>\n Took a screen capture: Merged-csv-files<\/p>\n
\n<\/a><\/p>\ndocs.google.com\/feeds\/download\/spreadsheets\/Export?key<FILE_ID>&exportFormat=csv&gid=0\r\n<\/code><\/pre>\n
Since I'm on a Mac at the moment, I think the fastest way is going to be \"Merge multiple .csv files.\"<\/code><\/p>\n