The previous two notebook entries established some methods of text processing.
The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped.
A common way to do this is with Python and Beautiful Soup.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
There is a nice introduction to Beautiful Soup here: <http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python>
I’m particularly interested in sending the scraped data to a SQLite database (and eventually do some network analysis).
I can’t remember exactly where I first read about this workflow (if only I had posted it here!) but I think it had something to do with a stack overflow post.
So I signed up for a free account on ScraperWiki, which allows 3 free datasets at the “community” level of membership. Also the ability to code in the browser. More here: <https://blog.scraperwiki.com/2012/12/a-small-matter-of-programming/>.
ScraperWiki is a platform for doing data science on the web.
From reviewing <http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/>, it looks like I need a few things
Pip
https://raw.githubusercontent.com/pypa/pip/master/contrib/get-pip.py
I don’t really know why, I already had pip installed and could not use it. I thought maybe I should change Unix shells and that worked – I switched to bash.
Here’s the rest of what I typed in (and I have saved the selected output from terminal to upload here). The elipses indicate where the shell did some things.
bash-3.2$ sudo easy_install pip
...
bash-3.2$ pip install requests
...
bash-3.2$ pip install BeautifulSoup
...
bash-3.2$ pip install scraperwiki
So now I feel like I have the things I want to experiment with Web scraping. I think I can just format the text, but I would like to try doing something from a site so I just made this page <https://sites.google.com/site/mountainsol/cv> and will probably paste something in that I generated yesterday (PDF to Word to HTML version of my LinkedIN profile).
I opened up the .htm version of my Word to HTML version of my LinkedIN PDF profile, and basically followed the tutorial here:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
It’s a lot of fun so far and I’m looking forward to seeing what I can extract.
Hey Tanner,
You should try Selenium, it can scrape JavaScript and Ajax content.
I wrote about it on my website.
If you would like to take a look, I can send the link to you.
Regards,