Web Scraping with Python Libraries

The previous two notebook entries established some methods of text processing.

The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped.

A common way to do this is with Python and Beautiful Soup.

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

 

There is a nice introduction to Beautiful Soup here: <http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python>

I’m particularly interested in sending the scraped data to a SQLite database (and eventually do some  network analysis).

I can’t remember exactly where I first read about this workflow (if only I had posted it here!) but I think it had something to do with a stack overflow post.

So I signed up for a free account on ScraperWiki, which allows 3 free datasets at the “community” level of membership.   Also the ability to code in the browser.  More here: <https://blog.scraperwiki.com/2012/12/a-small-matter-of-programming/>.

ScraperWiki is a platform for doing data science on the web.

 

From reviewing <http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/>, it looks like I need a few things

Pip

https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py

I don’t really know why, I already had pip installed and could not use it. I thought maybe I should change Unix shells and that worked – I switched to bash.

Here’s the rest of what I typed in (and I have saved the selected output from terminal to upload here). The elipses indicate where the shell did some things.


bash-3.2$ sudo easy_install pip
...
bash-3.2$ pip install requests
...
bash-3.2$ pip install BeautifulSoup
...
bash-3.2$ pip install scraperwiki

So now I feel like I have the things I want to experiment with Web scraping. I think I can just format the text, but I would like to try doing something from a site so I just made this page <https://sites.google.com/site/mountainsol/cv> and will probably paste something in that I generated yesterday (PDF to Word to HTML version of my LinkedIN profile).

I opened up the .htm version of my Word to HTML version of my LinkedIN PDF profile, and basically followed the tutorial here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

It’s a lot of fun so far and I’m looking forward to seeing what I can extract.

 

About Tanner Jessel

I am a graduate research assistant funded by DataONE and pursuing a Masters in Information Sciences with an Interdisciplinary Graduate Minor in Computational Science. I assist scholarly research efforts supporting the Sociocultural, Usability and Assessment, and Member Nodes working groups within DataONE. I am based at the Center for Information and Communication Studies at the University of Tennessee School of Information Science in Knoxville, Tennessee.

One Reply to “Web Scraping with Python Libraries”

Leave a Reply

Your email address will not be published. Required fields are marked *

*