Scraper is a simple data mining extension for Google Chrome™ that is useful for online research when you need to quickly analyze data in spreadsheet form.
Reviewing some of the software tools available from the DataONE web site <http://www.dataone.org/software_tools_catalog> I noticed an entry for “iMacros” that can automate scraping <http://imacros.net/browser/fx/welcome>.
I also realized that I only need 149 pages of Tweets from Topsy, far less than the 1490 I mistakenly estimated earlier. This is manageable to do by hand, although I would prefer to have something more “professional” like a script.
For the sake of Open Notebook Science, I am attaching a “Master Tweet” list to this post (20100801-DataONEorg-Tweet-Master.csv).
I want to take a moment to explain how I generated this file.
First, I established the maximum range of values for Topsy.
Then, I took the maximum range and worked backwards through each date range, which is divided by years of the DataONE program. This is what you will see in the table data for column 1 – Y1, Y2, Y3, and Y4. The data within the tables are valid for August 1, 2010 through February 4, 2040.
Within the same date range, all of the URLS follow a similar structure, as evident below:
The maximum for this particular date range (August 1, 2010 – August 1, 2011) is “70.” That’s a number I’ve called the “offset” key because the word “offset” is used to describe the pagination within the URL – in the example above, the “offset” is set to 10.
The pagination is in increments of 10. This means that “11” will not produce results.
Therefore, URLs for harvesting are generated by increasing the offset by 10 for each iteration.
Outside of scripting skills that I am not aware of, a simple algorithm can accomplish this, even for very large range (as was needed for other quarters where the offset key ranged from 10 to 560).
The methods are as follows:
Take the URL:
Place it into a spreadsheet in column “A.”
Excise (Ctrl X) the portion that you are interested in modifying. In this case, the portion I am interested in modifying is what follows the “offset=” portion of the URL.
Following the example, you would end up with “http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=” in column A of a spreadsheet.
Paste the remainder into column B.
Excise again the section of the URL that you are NOT going to modify.
In this case, the remainder is “&mintime=1280664024&maxtime=1312113656.”
You are left with three pieces of a URL: two you will not modify and 1 section you will.
Propogate the changes in the target column.
IN this case the changes are “10, 20, 30, 40, 50, 60” and so on from row 1 onward.
Upon conclusion, propogate the sections of the URL that you are not interested in changing, to match the section that you are changing.
Exploiting the “Find” and “Replace” function of the text editor, find all “Tabs” and replace with “deletions.” Essentially, this means copying the white space (typically 5 spaces) and directing the text editor to find all instances of that, then making sure to “delete” the 5 white spaces in the “Replace” function.
This will result in a plain-text, continuous URL with the modifications that you made in column 1. You may then past the information back into your spreadsheet and save as a .csv file in accordance with DataONE Best Practices for data management.
For the purposes of harvesting the tweets, I can either open them manually in tabs and use Google Chrome’s “Scraper” tool, or I can open them automatically in Firefox using a Firefox extension called “Linky.” Linky is something of a quality assurance step. It prevents me from “forgetting” to open a link and allows me to be systematic about opening links. It also opens link from a text list of links, with the caveat that it only opens 100 links in tabs at a time. Therefore, I can copy 100 rows of my CSV data over to a plain text file, then use Linky to open the text URLs in my browser. As I copy out the tweets, I can then systematically close out the windows with Ctrl W. This is about as automated as I can get without scripting.
I’ve created two such files now while I have the CSV document open.
Now I need to evaluate the iMacros tool. I’m inclined to use Firefox because I already know how to use Linky. However, the new Scraper extension I tried for Google Chrome worked very well, and nicely transferred scraped content into a spreadsheet for me. It might cut out some extra steps. I am not aware that a “link opening” tool such as Linky exists for Google Chrome. However, I do know that iMacros has a “Chrome” extension. So, if “iMacros” can automate opening up URLs, especially form a text base, perhaps I can use both.
It is also possible that iMacros will handle both opening links and scraping. We’ll see. I’ve looked at it and feel like it might be too complicated for me to mess with when I have a pretty clear idea of what I want to do already. Trying to learn how to effectively use the macro tool might take as much time as executing a method I’m already familiar with – although there are some handy YouTube videos <http://www.youtube.com/results?search_query=imacros>
My problem so far is that I don’t have a “scraper” app for Firefox and I don’t have a “link management” app for Chrome.
I did a quick search in Chrome Web store for a browser add-on:
I’m not sure what search term I used to access that one.
There is another extension called “LinkClump” that appears to be more highly rated.
I’ll try the first one, LinkMiner, since I already installed it.
Open up the plain text document in Chrome:
Sad to report it won’t open my links. However, that’s not really a problem. can make them into links using something like Dreamweaver. But I don’t have that so I’m just going to change my list of links into a list of generic HTML links by modifying the .csv file and running the find-and-replace algorithm to excise the whitespace.
http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=140&mintime=1312200024&maxtime=1343736082 becomes <a href ="http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=140&mintime=1312200024&maxtime=1343736082 ">1</a>
I’m uploading the csv data here to demonstrate the method.
So now I just need to copy and paste the CSV data into notepad++, run the find and replace algorithm to remove white space, and save the whole thing as a .htm file to open up in Chrome as hyperlinks. I’m also uploading that here (actually I’m uploading it as a .txt file because WordPress apparently disallows documents uploaded as .html – simply change the .txt to a .html extension).
I don’t know how to open a file in Chrome so just drag and drop from Windows file explorer into the address bar of Chrome.
After right clicking and attempting to open, I learned I should have trusted the reviews. Apparently the link manager I installed does not work with newer versions of Chrome. So, let’s try the other one.
Ok – the other one works, but not with any of the methods I am used to.
So, I added the URLs to a personal Google Site,
Where it works fine with the add-on.
I’ll just do 1 – 20 and see how long it takes me to scrape those.
Works fine with Scraper extension. Gotta save each time though, or it will override each time.
Actually I started naming them Y110, Y120, Y130 to group them by years. ten at a time. First twenty are done. Took maybe 30 minutes? Should have logged a start time. Next twenty I’ll do that.
For reference, this is where I stopped for the day – 130
I changed the settings on one spreadsheet so Y110 data (first ten tweets of Year One – but I need to double check a few things) is visible on the open web.
However I am interested in aggregating all of these into one spreadsheet so I’m not going to bother with the remainder.