Text Processing Methods for Data Extraction (PDF to HTML conversion)

Tried for mac:

VeryPDF “PDF to Any Converter” Did not like it.  PDF to HTML was not good.  PDF to Excel was ok, but one complaint is the documents are placed into a new folder

Might be useful: http://sourceforge.net/projects/pdftohtml/.

From the first freeware/trial ware software I tried, I’m definitely dog-earing the open source option.

A text-processing application written in C++ (?) that converts PDF to HTML or XML.  Kind of frustrating as there is no documentation, so I poked around for some other options.  There is some help here: http://sourceforge.net/p/pdftohtml/discussion/150221

I looked for “compiling on a mac” and found this:

UPDATE:  brew install pdftohtml

Brew is “homebrew” and makes things pretty easy.

Last login: Wed Apr  9 16:07:20 on console
tjessel:~ apple$ brew install pdftohtml
==> Downloading https://downloads.sourceforge.net/project/pdftohtml/Experimental
######################################################################## 100.0%==> make /usr/local/Cellar/pdftohtml/0.40a: 4 files, 900K, built in 28 seconds
tjessel:~ apple$

Now… how to compile.

Apparently you just type in “pdftohtml” in the terminal.


tjessel:~ apple$ pdftohtml
pdftohtml version 0.40 http://pdftohtml.sourceforge.net/, based on Xpdf version 3.01
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2005 Glyph & Cog, LLC
Usage: pdftohtml [options] [] -f : first page to convert
-l : last page to convert
-q : don't print any messages or errors
-h : print usage information
-help : print usage information
-p : exchange .pdf links by .html
-c : generate complex document
-i : ignore images
-noframes : generate no frames
-stdout : use standard output
-zoom : zoom the pdf document (default 1.5)
-xml : output for XML post-processing
-hidden : output hidden text
-enc : output text encoding name
-dev : output device name for Ghostscript (png16m, jpeg etc)
-v : print copyright and version info
-opw : owner password (for encrypted files)
-upw : user password (for encrypted files)
tjessel:~ apple$

So I think I need to navigate to the folder I’m in?
Currently the PDFs I am interested in are on an external flash drive.  I’m moving it to “Documents” on my HDD.

So let’s see I have this folder: linkedin profiles-WG-Members

Tried this, doesn’t work.

tjessel:linkedin profiles-WG-Members apple$ pdftohtml *.pdf *.html

This apparently works:

tjessel:linkedin profiles-WG-Members apple$ pdftohtml TannerJessel.pdf TannerJessel.html

This generated three new files in the same directory as “TannerJessel.pdf.”

1) TannerJessel_ind.html (this is an index)

2)TannerJessel.html (this is frames and combines both TannerJessels and TannerJessel_ind into one file.

3)TannerJessels.html (Just the file (wonder if “s” is for single?).

I’m pasting the output of TannerJessels.html below to inspect (on second thought, I’m not – posting HTML on WordPress with the <code> tag is more trouble than it’s worth, so I am posting a .txt file, TannerJessel.txt. (Note: cheat sheet for code in wordpress here) I’m unhappy because I was hoping for the headings to be preserved.

I opened it up and am not very excited about the results.

From my PDF version of my LinkedIN profile – what fields am I interested in? Let’s open it up. TannerJessel.pdf.

Format seems to be this:

  • Name
  • Title
  • E-mail address
  • Summary
  • Specialties
  • Experience
  • Publications
  • Education
  • Honors and Awards
  • Interests

There are a variety of fields that apparently LinkedIN stores, and I think the fields vary based on when the profile was created or last edited.  So far, these are the fields that I have identified and am tracking in a spreadsheet entitled “LinkedIN-fields.xlsx.”

  • Summary
  • Specialties
  • Experience
  • Skills & Expertise
  • Education
  • Languages
  • Publications
  • Honors and Awards
  • Interests
  • Volunteer Experience

Skills and Expertise is an interesting one I don’t have but a lot of DataONE working group members do have.

So let’s look at the way these headings are outlined. I wanted the formatting preserved, and in some ways it is with “ft6” or “ft2” for various font sizes.

However, there is no difference between the “heading” and the “text” for something within the “Experience” section.  It is possible I could program something to pull out “Experience<br>” for “Experience” but this is problematic.  Let me try .xml instead of .html.

[code language=
“bash”]

tjessel:linkedin profiles-WG-Members apple$ pdftohtml TannerJessel.pdf TannerJessel.xml

[/code]

Well that did not  work. I just got a file TannerJessel.xml.html.

Tried this

tjessel:linkedin profiles-WG-Members apple$ pdftohtml -xml TannerJessel.pdf TannerJessel

This creates an XML document with a lot more markup.

Here’s my name:

<text top=”57″ left=”54″ width=”176″ height=”27″ font=”9″>Tanner Jessel</text>

That appears twice, so I think that’s useful for parsing out the name of the person I’m interested in.

So this is good, the headings appear to have some consistent formatting:

<text top=”179″ left=”54″ width=”93″ height=”22″ font=”3″>Summary</text>

<text top=”504″ left=”54″ width=”104″ height=”22″ font=”3″>Specialties</text>

<text top=”647″ left=”54″ width=”108″ height=”22″ font=”3″>Experience</text>

This part is a bit problematic – the longer line is wrapped.

<text top=”682″ left=”54″ width=”747″ height=”16″ font=”158″>Graduate Research Assistant, Data Observation Network for Earth (DataONE) at University of</text>
<text top=”701″ left=”54″ width=”78″ height=”16″ font=”5″>Tennessee</text>

note that “University of Tennessee” is just wrong – the heights are the same but the fonts are not. That might mess things up a bit in terms of trying to parse a position for an individual out.

There is some kind of formatting quirk which might be from LinkedIN’s own conversion from their database to a printable PDF.

For example, on my actual LinkedIN profile, my profile looks (approximately) like this:

Graduate Research Assistant, Data Observation Network for Earth (DataONE)

University of Tennessee

August 2012 – Present (1 year 9 months)Center for Information and Communication Studies

Graduate Research Assistant for the Usability and Assessment working group, which conducts research to measure both the current data practices and opinions of DataONE stakeholders and the usability of DataONE for these stakeholders. Stakeholders include scientists, data managers, librarians, and educators.

Note that the job title is separate from the location. By exporting as a PDF, LinkedIN is adding in the “at” rather than follow consistent formatting. Also looks like the PDF version omitted “Center for Communication and Information Studies. It’s possible this won’t be a problem but I’m not sure.

This might work for time period:

>August 2012 – Present </span>

>August 2012 – Present </span><span>

But there is no difference between past and present jobs. For example:

Current job is this:

font=”158″>Graduate Research Assistant,

Past job is this:

font=”250″>Internet Services Specialist

So why do these have different fonts?

This might post a problem for using PDF2HTML, although it appears that the main headings are consistent with font=”3″ throughout.  I need to see if that holds true for the other PDFs converted to an XML format.

There may also be some other options.

Exploring “complex” output now by replacing -xml with -c.

Said “Failed to launch ghostscript” so I searched and found this: http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/8.71/ghostscript-8.71-macosx.tar.gz/download – although the most recent version appears to be ghostpdl-9.14.tar.gz

Not exactly sure what it is but here is some info: http://ghostscript.com/FAQ.html

I moved here to “call” ghostscript.

tjessel:ghostscript-8.71-macosx apple$ ./gs-8.71-macosx

I need to point to this from somewhere. In the Path somewhere…

Some help file:

http://sourceforge.net/p/pdftohtml/discussion/150221/thread/03aebf38/

Some help file:

http://ghostscript.com/doc/current/Install.htm#Install_Unix

Will have to investigate further.

It is a lot easier to do:

brew install ghostscript

This seems to have worked.  For all that trouble did it do anything? Yes.

tjessel:linkedin profiles-WG-Members apple$ pdftohtml -c -i -noframes TannerJessel.pdf TannerJessel.html

The “complex document” -c option gave me richer HTML formatting.

However I am still hitting a dead end at the job titles – for example

past job –

class=”ft250″>Internet Services Specialist

and current job –

class=”ft176″>Graduate Research Assistant

So that is a bit frustrating.  I like that this is open source and command line, but I plan to continue looking at some of the other options that might preserve headings before I move on to using beautiful soup to parse the HTML file.

There is also a paper that describes “preprocessing” for extracting information from scientific publications:

Preprocessing is first done to convert the document from PDF to plain text and HTML formats, using the PDF995 utility suite. The plain text form is first processed to delimit sentences, then passed to a modern maximum entropy based part-of-speech (POS) tagger [11].

From:

Nguyen, T. D., & Kan, M. Y. (2007). Keyphrase extraction in scientific publications. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers (pp. 317-326). Springer Berlin Heidelberg, doi:10.1007/978-3-540-77094-7_41.

The PDF995 suite works on Windows:

Pdf995 makes it easy and affordable to create professional-quality documents in the popular PDF file format. Its easy-to-use interface helps you to create PDF files by simply selecting the “print” command from any application, creating documents which can be viewed on any computer with a PDF viewer. Pdf995 supports network file saving, shared printing, Citrix/Terminal Server, custom page sizes and large format printing. Pdf995 is a printer driver that works with any Postscript to PDF converter. The pdf995 printer driver and a free Converter are available for easy download. Download Now.

 I’ll have to see if that has a different result.

 

About Tanner Jessel

I am a graduate research assistant funded by DataONE and pursuing a Masters in Information Sciences with an Interdisciplinary Graduate Minor in Computational Science. I assist scholarly research efforts supporting the Sociocultural, Usability and Assessment, and Member Nodes working groups within DataONE. I am based at the Center for Information and Communication Studies at the University of Tennessee School of Information Science in Knoxville, Tennessee.

Leave a Reply

Your email address will not be published. Required fields are marked *

*