{"id":1959,"date":"2014-02-04T05:09:12","date_gmt":"2014-02-04T05:09:12","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=1959"},"modified":"2014-02-04T15:45:25","modified_gmt":"2014-02-04T15:45:25","slug":"harvesting-dataoneorg-twitter-mentions-via-topsy","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/harvesting-dataoneorg-twitter-mentions-via-topsy\/","title":{"rendered":"Harvesting @DataONEorg Twitter Mentions via Topsy"},"content":{"rendered":"

The previous notebook entry<\/a> concerned mentions of @DataONEorg on Twitter.<\/p>\n

I established the following:<\/p>\n

The oldest tweet is from 2 years ago.<\/p>\n

It is dated July 29, 2012.<\/p>\n

This tweet is accessible from here:<\/p>\n

http:\/\/topsy.com\/s?q=%40DataONEorg&window=a&type=tweet&sort=date&offset=990<\/a><\/p>\n

The very first re-tweet of @DataONEorg was March 15, 2011.<\/p>\n

This was 5 months after @DataONEorg joined Twitter (November 18, 2010).<\/p>\n

The tweet is accessible via Topsy from this link:<\/p>\n

http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=150&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n

This is valid for the time period November 1, 2010 to November 1, 2011.<\/p>\n

I need the missing period between November 1, 2011 and July 29, 2012.<\/p>\n

I must generate a new search limited to that time period on Topsy.<\/p>\n

The link for the time period November 1, 2011 to July 29, 2012 is:<\/p>\n

http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&mintime=1320148824&maxtime=1343563251<\/a><\/p>\n

I now have three links for two time periods:<\/p>\n

    \n
  1. November 1, 2010 to November 1, 2011http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=170&mintime=1288612824&maxtime=1320148851<\/a><\/li>\n
  2. November 1, 2011 to July 29, 2012 is:\u00a0http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=340&mintime=1320148824&maxtime=1343563251<\/a><\/li>\n
  3. July 29, 2012 – July 29, 2013\u00a0http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=570&mintime=1343563224&maxtime=1375099251<\/a><\/li>\n
  4. July 29, 2013 – February 6, 2014\u00a0http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=410&mintime=1375099224&maxtime=1391688051<\/a><\/li>\n<\/ol>\n

    It is now possible to estimate number of tweets, based on 10 tweets per page:<\/p>\n

      \n
    1. November 1, 2010 to November 1, 2011 n = 170, 1700 tweets
      \n<\/a><\/li>\n
    2. November 1, 2011 to July 29, 2012 \u00a0n = 340, 3,400 tweets<\/li>\n
    3. July 29, 2012 – July 29, 2013 n = 570, 5,700 tweets
      \n<\/a><\/li>\n
    4. July 29, 2013 – February 6, 2014 n = 410, 4,100 tweets<\/li>\n<\/ol>\n

      Now I need to create a spreadsheet with unique URLs for each page of 10 tweets each, counting down from the maximum tweet for each time period.<\/p>\n

      For example:<\/p>\n

      http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=170&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n

      http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=160&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n

      http:\/\/topsy.com\/s?q=%40DataONEorg&type=tweet&sort=date&offset=150&mintime=1288612824&maxtime=1320148851<\/a><\/p>\n

      And so on.<\/p>\n

      There are a total of 15,500 tweets.<\/p>\n

      There would be 1,500 rows of unique URLs encompassing each of the four time periods.<\/p>\n

      I have three possibilities in mind for extracting this data.<\/p>\n

      1. Try the linky Firefox add on to collect 10 items 1,500 times (probably impractical)<\/p>\n

      I can view 100 pages at a time. Would only take 15 iterations of that. Worth looking at.<\/p>\n

      2. Try Xenu link checking software to harvest links as if doing a link check.<\/p>\n

      3. Some other URL scraping tool.<\/p>\n

      I will investigate this further on a PC, as Xenu works on a PC.<\/p>\n

      This PHP example did not work:<\/p>\n

      http:\/\/www.web-max.ca\/PHP\/misc_23.php<\/p>\n

      This may be worth looking at:<\/p>\n

      Scraping multiple Pages using the Scraper Extension and Refine – See more at: http:\/\/schoolofdata.org\/handbook\/recipes\/scraping-multiple-pages-with-refine-and-scraper\/<\/a><\/p>\n

      I’ll need to test these possibilities.<\/p>\n

       <\/p>\n","protected":false},"excerpt":{"rendered":"

      The previous notebook entry concerned mentions of @DataONEorg on Twitter. I established the following: The oldest tweet is from 2 years ago. It is dated July 29, 2012. This tweet is accessible from here: http:\/\/topsy.com\/s?q=%40DataONEorg&window=a&type=tweet&sort=date&offset=990 The very first re-tweet of @DataONEorg was March 15, 2011. This was 5 months after Continue reading Harvesting @DataONEorg Twitter Mentions via Topsy<\/span>→<\/span><\/a><\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[233,23,140,215,227,192,232],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1959"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=1959"}],"version-history":[{"count":2,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1959\/revisions"}],"predecessor-version":[{"id":1961,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/1959\/revisions\/1961"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=1959"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=1959"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=1959"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}