Web Scraping with Python Libraries

The previous two notebook entries established some methods of text processing.

The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped.

A common way to do this is with Python and Beautiful Soup.

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

There is a nice introduction to Beautiful Soup here: <http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python>

I’m particularly interested in sending the scraped data to a SQLite database (and eventually do some network analysis).

I can’t remember exactly where I first read about this workflow (if only I had posted it here!) but I think it had something to do with a stack overflow post.

So I signed up for a free account on ScraperWiki, which allows 3 free datasets at the “community” level of membership. Also the ability to code in the browser. More here: <https://blog.scraperwiki.com/2012/12/a-small-matter-of-programming/>.

ScraperWiki is a platform for doing data science on the web.

From reviewing <http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/>, it looks like I need a few things

Pip

https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py

I don’t really know why, I already had pip installed and could not use it. I thought maybe I should change Unix shells and that worked – I switched to bash.

Here’s the rest of what I typed in (and I have saved the selected output from terminal to upload here). The elipses indicate where the shell did some things.

bash-3.2$ sudo easy_install pip ... bash-3.2$ pip install requests ... bash-3.2$ pip install BeautifulSoup ... bash-3.2$ pip install scraperwiki

So now I feel like I have the things I want to experiment with Web scraping. I think I can just format the text, but I would like to try doing something from a site so I just made this page <https://sites.google.com/site/mountainsol/cv> and will probably paste something in that I generated yesterday (PDF to Word to HTML version of my LinkedIN profile).

I opened up the .htm version of my Word to HTML version of my LinkedIN PDF profile, and basically followed the tutorial here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

It’s a lot of fun so far and I’m looking forward to seeing what I can extract.

One Reply to “Web Scraping with Python Libraries”

Hey Tanner,

You should try Selenium, it can scrape JavaScript and Ajax content.
I wrote about it on my website.
If you would like to take a look, I can send the link to you.

Regards,

Reply ↓

Web Scraping with Python Libraries

About Tanner Jessel

One Reply to “Web Scraping with Python Libraries”

Leave a Reply Cancel reply