Web Scraping with Python Libraries

The previous two notebook entries established some methods of text processing. The point is to create a file system populated with HTML formatted text documents. This permits the directory to be crawled and scraped. A common way to do this is with Python and Beautiful Soup. Beautiful Soup is a Python Continue reading Web Scraping with Python Libraries

Text Processing Methods, Continued (PDF to HTML Conversion)

I am continuing evaluation of some text processing tools that I began in an earlier open notebook post on the same topic. I also had an idea that perhaps I should open my PDF documents in Word, then re-save them as HTML.  That workflow might standardize the formatting to something less Continue reading Text Processing Methods, Continued (PDF to HTML Conversion)

Text Processing Methods for Data Extraction (PDF to HTML conversion)

Tried for mac: VeryPDF “PDF to Any Converter” Did not like it.  PDF to HTML was not good.  PDF to Excel was ok, but one complaint is the documents are placed into a new folder Might be useful: http://sourceforge.net/projects/pdftohtml/. From the first freeware/trial ware software I tried, I’m definitely dog-earing Continue reading Text Processing Methods for Data Extraction (PDF to HTML conversion)

Data Management for Research Output with OpenWetWare Wiki

Although I want to be a professional data manager and have extensive training in data management, in practice I have realized it’s pretty tough to do, even for a small data analysis project like the Figshare users’ survey. I did data analysis for that on another computer, I was in Continue reading Data Management for Research Output with OpenWetWare Wiki

Tallying Every Bug and Byte

Nora was a PhD student when she attended a meeting that would change the course of her career. Until that point, Nora had thought of herself chiefly as an entomologist, with her primary work objective being (as she joked with her colleagues) counting bugs. Born and raised in the Midwest, Continue reading Tallying Every Bug and Byte

The Long and Winding Road to Public Data

Dr. Watson was accustomed to seeing dead things. As a wildlife ecologist, he had made a career out of investigating animals and their untimely demise under the rumbling engines of motor vehicles. Animal road mortalities had a reputation for being difficult to track because the majority of incidents went unreported, Continue reading The Long and Winding Road to Public Data

Early adopters of open research output: a study of the motivations and opinions of Figshare.com users (Poster)

This 3.8 MB poster (which actually exceeds the file size that may be uploaded onto this WordPress open research notebook) was presented at The University of Tennessee’s College of Communication and Information Research Symposium. It is a first public look at some research effort that has been discussed in other Continue reading Early adopters of open research output: a study of the motivations and opinions of Figshare.com users (Poster)

Consolidating Year 1 – Year 4 @DataONEorg Tweets

I am continuing quality control efforts today. From looking at checksums for the files, some of the 147 appear to be the same. This concerns me due to the possibility of human error (my error) in creating the files, since I scraped tweets manually with a browser extension, rather than Continue reading Consolidating Year 1 – Year 4 @DataONEorg Tweets

Continue Scraping, Introduce Quality Control with Hashes

Continuation and completion of harvesting with quality control / assurance exploration using hashes and checksum software. 5 months agoReplyRetweetFavorite1 more Start 97 – 77 97 contains year 3 and offset 450 Start at 12:05 Save text file Topsy-97-77 End at 12:21 New File Topsy-76-56 56 ends at Y3040 Expand to Continue reading Continue Scraping, Introduce Quality Control with Hashes