{"id":2040,"date":"2014-04-16T06:29:47","date_gmt":"2014-04-16T06:29:47","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2040"},"modified":"2014-04-16T06:29:47","modified_gmt":"2014-04-16T06:29:47","slug":"web-scraping-with-python-libraries","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/web-scraping-with-python-libraries\/","title":{"rendered":"Web Scraping with Python Libraries"},"content":{"rendered":"
The previous two notebook entries established some methods of text processing.<\/p>\n
The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped.<\/p>\n
A common way to do this is with Python and Beautiful Soup<\/a>.<\/p>\n Beautiful Soup<\/a>\u00a0is a Python library for pulling data out of HTML and XML files.<\/p>\n <\/p><\/blockquote>\n There is a nice introduction to Beautiful Soup here: <http:\/\/www.pythonforbeginners.com\/beautifulsoup\/beautifulsoup-4-python<\/a>><\/p>\n I’m particularly interested in sending the scraped data to a SQLite database (and eventually do some \u00a0network analysis).<\/p>\n I can’t remember exactly where I first read about this workflow (if only I had posted it here!) but I think it had something to do with a stack overflow post.<\/p>\n So I signed up for a free account on ScraperWiki<\/a>, which allows 3 free datasets at the “community” level of membership. \u00a0 Also the ability to code in the browser. \u00a0More here: <https:\/\/blog.scraperwiki.com\/2012\/12\/a-small-matter-of-programming\/<\/a>>.<\/p>\n ScraperWiki is a platform for doing data science on the web.<\/p>\n <\/p><\/blockquote>\n From reviewing <http:\/\/www.pythonforbeginners.com\/python-on-the-web\/web-scraping-with-beautifulsoup\/<\/a>>, it looks like I need a few things<\/p>\n Pip<\/p>\n https:\/\/raw.githubusercontent.com\/pypa\/pip\/master\/contrib\/get-pip.py<\/a><\/p>\n I don’t really know why, I already had pip installed and could not use it. I thought maybe I should change Unix shells and that worked – I switched to bash.<\/p>\n Here’s the rest of what I typed in (and I have saved the selected output from terminal to upload here). The elipses indicate where the shell did some things.<\/p>\n So now I feel like I have the things I want to experiment with Web scraping. I think I can just format the text, but I would like to try doing something from a site so I just made this page <https:\/\/sites.google.com\/site\/mountainsol\/cv<\/a>> and will probably paste something in that I generated yesterday (PDF to Word to HTML version of my LinkedIN profile).<\/p>\n I opened up the .htm version of my Word to HTML version of my LinkedIN PDF profile, and basically followed the tutorial here:<\/p>\n http:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/#quick-start<\/a><\/p>\n It’s a lot of fun so far and I’m looking forward to seeing what I can extract.<\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" The previous two notebook entries established some methods of text processing. The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped. A common way to do this is with Python and Beautiful Soup. Beautiful Soup\u00a0is a Python Continue reading Web Scraping with Python Libraries<\/span>
\nbash-3.2$ sudo easy_install pip
\n...
\nbash-3.2$ pip install requests
\n...
\nbash-3.2$\u00a0pip install BeautifulSoup
\n...
\nbash-3.2$\u00a0pip install scraperwiki
\n<\/code><\/p>\n