Harvesting @DataONEorg Twitter Mentions via Topsy

The previous notebook entry concerned mentions of @DataONEorg on Twitter.

I established the following:

The oldest tweet is from 2 years ago.

It is dated July 29, 2012.

This tweet is accessible from here:

http://topsy.com/s?q=%40DataONEorg&window=a&type=tweet&sort=date&offset=990

The very first re-tweet of @DataONEorg was March 15, 2011.

This was 5 months after @DataONEorg joined Twitter (November 18, 2010).

The tweet is accessible via Topsy from this link:

http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=150&mintime=1288612824&maxtime=1320148851

This is valid for the time period November 1, 2010 to November 1, 2011.

I need the missing period between November 1, 2011 and July 29, 2012.

I must generate a new search limited to that time period on Topsy.

The link for the time period November 1, 2011 to July 29, 2012 is:

http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&mintime=1320148824&maxtime=1343563251

I now have three links for two time periods:

  1. November 1, 2010 to November 1, 2011http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=170&mintime=1288612824&maxtime=1320148851
  2. November 1, 2011 to July 29, 2012 is:ย http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=340&mintime=1320148824&maxtime=1343563251
  3. July 29, 2012 – July 29, 2013ย http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=570&mintime=1343563224&maxtime=1375099251
  4. July 29, 2013 – February 6, 2014ย http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=410&mintime=1375099224&maxtime=1391688051

It is now possible to estimate number of tweets, based on 10 tweets per page:

  1. November 1, 2010 to November 1, 2011 n = 170, 1700 tweets
  2. November 1, 2011 to July 29, 2012 ย n = 340, 3,400 tweets
  3. July 29, 2012 – July 29, 2013 n = 570, 5,700 tweets
  4. July 29, 2013 – February 6, 2014 n = 410, 4,100 tweets

Now I need to create a spreadsheet with unique URLs for each page of 10 tweets each, counting down from the maximum tweet for each time period.

For example:

http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=170&mintime=1288612824&maxtime=1320148851

http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=160&mintime=1288612824&maxtime=1320148851

http://topsy.com/s?q=%40DataONEorg&type=tweet&sort=date&offset=150&mintime=1288612824&maxtime=1320148851

And so on.

There are a total of 15,500 tweets.

There would be 1,500 rows of unique URLs encompassing each of the four time periods.

I have three possibilities in mind for extracting this data.

1. Try the linky Firefox add on to collect 10 items 1,500 times (probably impractical)

I can view 100 pages at a time. Would only take 15 iterations of that. Worth looking at.

2. Try Xenu link checking software to harvest links as if doing a link check.

3. Some other URL scraping tool.

I will investigate this further on a PC, as Xenu works on a PC.

This PHP example did not work:

http://www.web-max.ca/PHP/misc_23.php

This may be worth looking at:

Scraping multiple Pages using the Scraper Extension and Refine – See more at: http://schoolofdata.org/handbook/recipes/scraping-multiple-pages-with-refine-and-scraper/

I’ll need to test these possibilities.

 

About Tanner Jessel

I am a graduate research assistant funded by DataONE and pursuing a Masters in Information Sciences with an Interdisciplinary Graduate Minor in Computational Science. I assist scholarly research efforts supporting the Sociocultural, Usability and Assessment, and Member Nodes working groups within DataONE. I am based at the Center for Information and Communication Studies at the University of Tennessee School of Information Science in Knoxville, Tennessee.

2 Replies to “Harvesting @DataONEorg Twitter Mentions via Topsy”

  1. Pingback: Extraction, Transform and Load | TannerJessel.info

  2. Pingback: Mentions of @DataONEorg August 1 2010 – Feb 4 – 2014 | TannerJessel.info

Leave a Reply

Your email address will not be published. Required fields are marked *

*