Web Scraping with Python Libraries

The previous two notebook entries established some methods of text processing. The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped. A common way to do this is with Python and Beautiful Soup. Beautiful Soup is a Python Continue reading

Text Processing Methods, Continued (PDF to HTML Conversion)

I am continuing evaluation of some text processing tools that I began in an earlier open notebook post on the same topic. I also had an idea that perhaps I should open my PDF documents in Word, then re-save them as HTML.  That workflow might standardize the formatting to something less Continue reading

Text Processing Methods for Data Extraction (PDF to HTML conversion)

Tried for mac: VeryPDF “PDF to Any Converter” Did not like it.  PDF to HTML was not good.  PDF to Excel was ok, but one complaint is the documents are placed into a new folder Might be useful: http://sourceforge.net/projects/pdftohtml/. From the first freeware/trial ware software I tried, I’m definitely dog-earing Continue reading

Data Management for Research Output with OpenWetWare Wiki

Although I want to be a professional data manager and have extensive training in data management, in practice I have realized it’s pretty tough to do, even for a small data analysis project like the Figshare users’ survey. I did data analysis for that on another computer, I was in Continue reading

The Long and Winding Road to Public Data

CC-BY-NC-SA by DJOtaku via flickr

Dr. Watson was accustomed to seeing dead things. As a wildlife ecologist, he had made a career out of investigating animals and their untimely demise under the rumbling engines of motor vehicles. Animal road mortalities had a reputation for being difficult to track because the majority of incidents went unreported, Continue reading

Early adopters of open research output: a study of the motivations and opinions of Figshare.com users (Poster)

This 3.8 MB poster (which actually exceeds the file size that may be uploaded onto this WordPress open research notebook) was presented at The University of Tennessee’s College of Communication and Information Research Symposium. It is a first public look at some research effort that has been discussed in other Continue reading

Consolidating Year 1 – Year 4 @DataONEorg Tweets

I am continuing quality control efforts today. From looking at checksums for the files, some of the 147 appear to be the same. This concerns me due to the possibility of human error (my error) in creating the files, since I scraped tweets manually with a browser extension, rather than Continue reading

Continue Scraping, Introduce Quality Control with Hashes

Continuation and completion of harvesting with quality control / assurance exploration using hashes and checksum software. 5 months agoReplyRetweetFavorite1 more Start 97 – 77 97 contains year 3 and offset 450 Start at 12:05 Save text file Topsy-97-77 End at 12:21 New File Topsy-76-56 56 ends at Y3040 Expand to Continue reading

Scraping @DataONEorg Tweets Off the Web with Browser Extensions

An earlier method I tried was unable to harvest tweets mentioning @DataONEorg using the Google Chrome Browser extension, “Scraper” Scraper is a simple data mining extension for Google Chrome™ that is useful for online research when you need to quickly analyze data in spreadsheet form. Reviewing some of the software Continue reading