{"id":2040,"date":"2014-04-16T06:29:47","date_gmt":"2014-04-16T06:29:47","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2040"},"modified":"2014-04-16T06:29:47","modified_gmt":"2014-04-16T06:29:47","slug":"web-scraping-with-python-libraries","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/web-scraping-with-python-libraries\/","title":{"rendered":"Web Scraping with Python Libraries"},"content":{"rendered":"

The previous two notebook entries established some methods of text processing.<\/p>\n

The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped.<\/p>\n

A common way to do this is with Python and Beautiful Soup<\/a>.<\/p>\n

Beautiful Soup<\/a>\u00a0is a Python library for pulling data out of HTML and XML files.<\/p>\n

 <\/p><\/blockquote>\n

There is a nice introduction to Beautiful Soup here: <http:\/\/www.pythonforbeginners.com\/beautifulsoup\/beautifulsoup-4-python<\/a>><\/p>\n

I’m particularly interested in sending the scraped data to a SQLite database (and eventually do some \u00a0network analysis).<\/p>\n

I can’t remember exactly where I first read about this workflow (if only I had posted it here!) but I think it had something to do with a stack overflow post.<\/p>\n

So I signed up for a free account on ScraperWiki<\/a>, which allows 3 free datasets at the “community” level of membership. \u00a0 Also the ability to code in the browser. \u00a0More here: <https:\/\/blog.scraperwiki.com\/2012\/12\/a-small-matter-of-programming\/<\/a>>.<\/p>\n

ScraperWiki is a platform for doing data science on the web.<\/p>\n

 <\/p><\/blockquote>\n

From reviewing <http:\/\/www.pythonforbeginners.com\/python-on-the-web\/web-scraping-with-beautifulsoup\/<\/a>>, it looks like I need a few things<\/p>\n

Pip<\/p>\n

https:\/\/raw.githubusercontent.com\/pypa\/pip\/master\/contrib\/get-pip.py<\/a><\/p>\n

I don’t really know why, I already had pip installed and could not use it. I thought maybe I should change Unix shells and that worked – I switched to bash.<\/p>\n

Here’s the rest of what I typed in (and I have saved the selected output from terminal to upload here). The elipses indicate where the shell did some things.<\/p>\n


\nbash-3.2$ sudo easy_install pip
\n...
\nbash-3.2$ pip install requests
\n...
\nbash-3.2$\u00a0pip install BeautifulSoup
\n...
\nbash-3.2$\u00a0pip install scraperwiki
\n<\/code><\/p>\n

So now I feel like I have the things I want to experiment with Web scraping. I think I can just format the text, but I would like to try doing something from a site so I just made this page <https:\/\/sites.google.com\/site\/mountainsol\/cv<\/a>> and will probably paste something in that I generated yesterday (PDF to Word to HTML version of my LinkedIN profile).<\/p>\n

I opened up the .htm version of my Word to HTML version of my LinkedIN PDF profile, and basically followed the tutorial here:<\/p>\n

http:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/#quick-start<\/a><\/p>\n

It’s a lot of fun so far and I’m looking forward to seeing what I can extract.<\/p>\n

 <\/p>\n","protected":false},"excerpt":{"rendered":"

The previous two notebook entries established some methods of text processing. The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped. A common way to do this is with Python and Beautiful Soup. Beautiful Soup\u00a0is a Python Continue reading Web Scraping with Python Libraries<\/span>→<\/span><\/a><\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[267,366,268,265,266,35,232],"_links":{"self":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2040"}],"collection":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/comments?post=2040"}],"version-history":[{"count":3,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2040\/revisions"}],"predecessor-version":[{"id":2043,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/posts\/2040\/revisions\/2043"}],"wp:attachment":[{"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/media?parent=2040"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/categories?post=2040"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/notebooks.dataone.org\/wp-json\/wp\/v2\/tags?post=2040"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}