{"id":2040,"date":"2014-04-16T06:29:47","date_gmt":"2014-04-16T06:29:47","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2040"},"modified":"2014-04-16T06:29:47","modified_gmt":"2014-04-16T06:29:47","slug":"web-scraping-with-python-libraries","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/web-scraping-with-python-libraries\/","title":{"rendered":"Web Scraping with Python Libraries"},"content":{"rendered":"<p>The previous two notebook entries established some methods of text processing.<\/p>\n<p>The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped.<\/p>\n<p>A common way to do this is with Python and <a href=\"http:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\" target=\"_blank\">Beautiful Soup<\/a>.<\/p>\n<blockquote><p><a href=\"http:\/\/www.crummy.com\/software\/BeautifulSoup\/\">Beautiful Soup<\/a>\u00a0is a Python library for pulling data out of HTML and XML files.<\/p>\n<p>&nbsp;<\/p><\/blockquote>\n<p>There is a nice introduction to Beautiful Soup here: &lt;<a href=\"http:\/\/www.pythonforbeginners.com\/beautifulsoup\/beautifulsoup-4-python\" target=\"_blank\">http:\/\/www.pythonforbeginners.com\/beautifulsoup\/beautifulsoup-4-python<\/a>&gt;<\/p>\n<p>I&#8217;m particularly interested in sending the scraped data to a SQLite database (and eventually do some \u00a0network analysis).<\/p>\n<p>I can&#8217;t remember exactly where I first read about this workflow (if only I had posted it here!) but I think it had something to do with a stack overflow post.<\/p>\n<p>So I signed up for a free account on <a href=\"https:\/\/scraperwiki.com\/\">ScraperWiki<\/a>, which allows 3 free datasets at the &#8220;community&#8221; level of membership. \u00a0 Also the ability to code in the browser. \u00a0More here: &lt;<a href=\"https:\/\/blog.scraperwiki.com\/2012\/12\/a-small-matter-of-programming\/\">https:\/\/blog.scraperwiki.com\/2012\/12\/a-small-matter-of-programming\/<\/a>&gt;.<\/p>\n<blockquote><p>ScraperWiki is a platform for doing data science on the web.<\/p>\n<p>&nbsp;<\/p><\/blockquote>\n<p>From reviewing &lt;<a href=\"http:\/\/www.pythonforbeginners.com\/python-on-the-web\/web-scraping-with-beautifulsoup\/\" target=\"_blank\">http:\/\/www.pythonforbeginners.com\/python-on-the-web\/web-scraping-with-beautifulsoup\/<\/a>&gt;, it looks like I need a few things<\/p>\n<p>Pip<\/p>\n<p><a href=\"https:\/\/raw.githubusercontent.com\/pypa\/pip\/master\/contrib\/get-pip.py\" target=\"_blank\">https:\/\/raw.githubusercontent.com\/pypa\/pip\/master\/contrib\/get-pip.py<\/a><\/p>\n<p>I don&#8217;t really know why, I already had pip installed and could not use it. I thought maybe I should change Unix shells and that worked &#8211; I switched to bash.<\/p>\n<p>Here&#8217;s the rest of what I typed in (and I have saved the selected output from terminal to upload here). The elipses indicate where the shell did some things.<\/p>\n<p><code><br \/>\nbash-3.2$ sudo easy_install pip<br \/>\n...<br \/>\nbash-3.2$ pip install requests<br \/>\n...<br \/>\nbash-3.2$\u00a0pip install BeautifulSoup<br \/>\n...<br \/>\nbash-3.2$\u00a0pip install scraperwiki<br \/>\n<\/code><\/p>\n<p>So now I feel like I have the things I want to experiment with Web scraping. I think I can just format the text, but I would like to try doing something from a site so I just made this page &lt;<a href=\"https:\/\/sites.google.com\/site\/mountainsol\/cv\" target=\"_blank\">https:\/\/sites.google.com\/site\/mountainsol\/cv<\/a>&gt; and will probably paste something in that I generated yesterday (PDF to Word to HTML version of my LinkedIN profile).<\/p>\n<p>I opened up the .htm version of my Word to HTML version of my LinkedIN PDF profile, and basically followed the tutorial here:<\/p>\n<p><a href=\"http:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/#quick-start\" target=\"_blank\">http:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/#quick-start<\/a><\/p>\n<p>It&#8217;s a lot of fun so far and I&#8217;m looking forward to seeing what I can extract.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The previous two notebook entries established some methods of text processing. The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped. A common way to do this is with Python and Beautiful Soup. Beautiful Soup\u00a0is a Python <a class=\"more-link\" href=\"https:\/\/notebooks.dataone.org\/data-science\/web-scraping-with-python-libraries\/\">Continue reading <span class=\"screen-reader-text\">  Web Scraping with Python Libraries<\/span><span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[267,366,268,265,266,35,232],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2040"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=2040"}],"version-history":[{"count":3,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2040\/revisions"}],"predecessor-version":[{"id":2043,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2040\/revisions\/2043"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=2040"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=2040"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=2040"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}