{"id":1963,"date":"2014-02-07T23:15:03","date_gmt":"2014-02-07T23:15:03","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1963"},"modified":"2014-02-07T23:15:03","modified_gmt":"2014-02-07T23:15:03","slug":"scraping-dataoneorg-tweets-off-the-web-with-browser-extensions","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/scraping-dataoneorg-tweets-off-the-web-with-browser-extensions\/","title":{"rendered":"Scraping @DataONEorg Tweets Off the Web with Browser Extensions"},"content":{"rendered":"

An earlier method I tried was unable to harvest tweets mentioning @DataONEorg<\/a> using the Google Chrome Browser extension, “Scraper<\/a>”<\/p>\n

Scraper is a simple data mining extension for Google Chrome\u2122 that is useful for online research when you need to quickly analyze data in spreadsheet form.<\/p><\/blockquote>\n

Reviewing some of the software tools available from the DataONE web site <http:\/\/www.dataone.org\/software_tools_catalo<\/a>g> I noticed an entry for “iMacros” that can automate scraping <http:\/\/imacros.net\/browser\/fx\/welcome<\/a>>.<\/p>\n

I also realized that I only need 149 pages of Tweets from Topsy<\/a>, far less than the 1490 I mistakenly estimated earlier.\u00a0 This is manageable to do by hand, although I would prefer to have something more “professional” like a script.<\/p>\n

For the sake of Open Notebook Science, I am attaching a “Master Tweet” list to this post (20100801-DataONEorg-Tweet-Master.csv<\/a>).<\/p>\n

I want to take a moment to explain how I generated this file.<\/p>\n

First, I established the maximum range of values for Topsy.<\/p>\n

Then, I took the maximum range and worked backwards through each date range, which is divided by years of the DataONE program.\u00a0 This is what you will see in the table data for column 1 – Y1, Y2, Y3, and Y4.\u00a0 The data within the tables are valid for August 1, 2010 through February 4, 2040.<\/p>\n

Within the same date range, all of the URLS follow a similar structure, as evident below:<\/p>\n

http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=10&mintime=1280664024&maxtime=1312113656<\/p>\n

The maximum for this particular date range (August 1, 2010 – August 1, 2011) is “70.”\u00a0 That’s a number I’ve called the “offset” key because the word “offset” is used to describe the pagination within the URL – in the example above, the “offset” is set to 10.<\/p>\n

The pagination is in increments of 10.\u00a0 This means that “11” will not produce results.<\/p>\n

Therefore, URLs for harvesting are generated by increasing the offset by 10 for each iteration.<\/p>\n

Outside of scripting skills that I am not aware of, a simple algorithm can accomplish this, even for very large range (as was needed for other quarters where the offset key ranged from 10 to 560).<\/p>\n

The methods are as follows:<\/p>\n

Take the URL:<\/p>\n

http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=10&mintime=1280664024&maxtime=1312113656<\/p>\n

Place it into a spreadsheet in column “A.”<\/p>\n

Excise (Ctrl X) the portion that you are interested in modifying.\u00a0 In this case, the portion I am interested in modifying is what follows the “offset=” portion of the URL.<\/p>\n

Following the example, you would end up with “http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=” in column A of a spreadsheet.<\/p>\n

Paste the remainder into column B.<\/p>\n

Excise again the section of the URL that you are NOT going to modify.<\/p>\n

In this case, the remainder is “&mintime=1280664024&maxtime=1312113656.”<\/p>\n

You are left with three pieces of a URL: two you will not modify and 1 section you will.<\/p>\n

Propogate the changes in the target column.<\/p>\n

IN this case the changes are “10, 20, 30, 40, 50, 60” and so on from row 1 onward.<\/p>\n

Upon conclusion, propogate the sections of the URL that you are not interested in changing, to match the section that you are changing.<\/p>\n

Copy all three columns to a text editor capable of executing the command “Find” and “Replace.” An appropriate text editor for Windows is Notepad++<\/a>.\u00a0 An appropriate editor for the Mac is TextWrangler<\/a>.<\/p>\n

Exploiting the “Find” and “Replace” function of the text editor, find all “Tabs” and replace with “deletions.”\u00a0 Essentially, this means copying the white space (typically 5 spaces) and directing the text editor to find all instances of that, then making sure to “delete” the 5 white spaces in the “Replace” function.<\/p>\n

This will result in a plain-text, continuous URL with the modifications that you made in column 1.\u00a0 You may then past the information back into your spreadsheet and save as a .csv file in accordance with DataONE Best Practices<\/a> for data management.<\/p>\n

20100801-DataONEorg-Tweet-Master.csv<\/a><\/p>\n

For the purposes of harvesting the tweets, I can either open them manually in tabs and use Google Chrome’s “Scraper” tool, or I can open them automatically in Firefox using a Firefox extension called “Linky<\/a>.”\u00a0 Linky is something of a quality assurance step.\u00a0 It prevents me from “forgetting” to open a link and allows me to be systematic about opening links.\u00a0 It also opens link from a text list of links, with the caveat that it only opens 100 links in tabs at a time. Therefore, I can copy 100 rows of my CSV data over to a plain text file, then use Linky to open the text URLs in my browser.\u00a0 As I copy out the tweets, I can then systematically close out the windows with Ctrl W. This is about as automated as I can get without scripting.<\/p>\n

I’ve created two such files now while I have the CSV document open.<\/p>\n

Now I need to evaluate the iMacros tool.\u00a0 I’m inclined to use Firefox because I already know how to use Linky.\u00a0 However, the new Scraper extension I tried for Google Chrome worked very well, and nicely transferred scraped content into a spreadsheet for me.\u00a0 It might cut out some extra steps.\u00a0 I am not aware that a “link opening” tool such as Linky exists for Google Chrome.\u00a0 However, I do know that iMacros has a “Chrome” extension<\/a>. So, if “iMacros” can automate opening up URLs, especially form a text base, perhaps I can use both.<\/p>\n

It is also possible that iMacros will handle both opening links and scraping. We’ll see.\u00a0 I’ve looked at it and feel like it might be too complicated for me to mess with when I have a pretty clear idea of what I want to do already. Trying to learn how to effectively use the macro tool might take as much time as executing a method I’m already familiar with – although there are some handy YouTube videos <http:\/\/www.youtube.com\/results?search_query=imacros<\/a>><\/p>\n

My problem so far is that I don’t have a “scraper” app for Firefox and I don’t have a “link management” app for Chrome.<\/p>\n

I did a quick search in Chrome Web store for a browser add-on:<\/p>\n

https:\/\/chrome.google.com\/webstore\/detail\/linkminer-open-all-links\/<\/p>\n

I’m not sure what search term I used to access that one.<\/p>\n

There is another extension called “LinkClump” that appears to be more highly rated.<\/p>\n

https:\/\/chrome.google.com\/webstore\/detail\/linkclump\/<\/p>\n

I’ll try the first one, LinkMiner, since I already installed it.<\/p>\n

Open up the plain text document in Chrome:<\/p>\n

file:\/\/\/C:\/Users\/tjessel\/Documents\/DataONE%20Research\/Twitter%20Data%20for%20DataONEorg\/1-100-Topsy-Links-DataONEorg.txt<\/p>\n

Sad to report it won’t open my links.\u00a0 However, that’s not really a problem.\u00a0 can make them into links using something like Dreamweaver. But I don’t have that so I’m just going to change my list of links into a list of generic HTML links by modifying the .csv file and running the find-and-replace algorithm to excise the whitespace.<\/p>\n

For example:<\/p>\n

http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=140&mintime=1312200024&maxtime=1343736082\r\n\r\nbecomes <a href =\"http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=140&mintime=1312200024&maxtime=1343736082\r\n\">1<\/a><\/pre>\n

I’m uploading the csv data here to demonstrate the method.<\/p>\n

20100801-DataONEorg-Tweet-Master-URLs<\/a><\/pre>\n

So now I just need to copy and paste the CSV data into notepad++, run the find and replace algorithm to remove white space, and save the whole thing as a .htm file to open up in Chrome as hyperlinks. I’m also uploading that here (actually I’m uploading it as a .txt file because WordPress apparently disallows documents uploaded as .html – simply change the .txt to a .html extension).<\/p>\n

topsy-links<\/a><\/p>\n

I don’t know how to open a file in Chrome so just drag and drop from Windows file explorer into the address bar of Chrome.<\/p>\n

file:\/\/\/C:\/Users\/tjessel\/Documents\/DataONE%20Research\/Twitter%20Data%20for%20DataONEorg\/topsy-links.html<\/p>\n

After right clicking and attempting to open, I learned I should have trusted the reviews.\u00a0 Apparently the link manager I installed does not work with newer versions of Chrome. So, let’s try the other one.<\/p>\n

Ok – the other one works, but not with any of the methods I am used to.<\/p>\n

So, I added the URLs to a personal Google Site,<\/p>\n

Provenance Repository and Publishing Toolkit Project<\/a><\/p><\/blockquote>\n