{"id":2015,"date":"2014-04-10T05:14:40","date_gmt":"2014-04-10T05:14:40","guid":{"rendered":"https:\/\/notebooks.dataone.org\/?p=2015"},"modified":"2014-04-10T18:47:49","modified_gmt":"2014-04-10T18:47:49","slug":"pdf-to-html-conversion","status":"publish","type":"post","link":"https:\/\/notebooks.dataone.org\/data-science\/pdf-to-html-conversion\/","title":{"rendered":"Text Processing Methods for Data Extraction (PDF to HTML conversion)"},"content":{"rendered":"

Tried for mac:<\/p>\n

VeryPDF “PDF to Any Converter” Did not like it. \u00a0PDF to HTML was not good. \u00a0PDF to Excel was ok, but one complaint is the documents are placed into a new folder<\/p>\n

Might be useful: http:\/\/sourceforge.net\/projects\/pdftohtml\/.<\/p>\n

From the first freeware\/trial ware software I tried, I’m definitely dog-earing the open source option.<\/p>\n

A text-processing application written in C++ (?) that converts PDF to HTML or XML. \u00a0Kind of frustrating as there is no documentation, so I poked around for some other options. \u00a0There is some help here:\u00a0http:\/\/sourceforge.net\/p\/pdftohtml\/discussion\/150221<\/a><\/p>\n

I looked for “compiling on a mac” and found this:<\/p>\n

UPDATE:\u00a0 brew install pdftohtml<\/p>\n

Brew is “homebrew<\/a>” and makes things pretty easy.
\n
\nLast login: Wed Apr\u00a0 9 16:07:20 on console
\ntjessel:~ apple$ brew install pdftohtml
\n==> Downloading https:\/\/downloads.sourceforge.net\/project\/pdftohtml\/Experimental
\n######################################################################## 100.0%==> make \/usr\/local\/Cellar\/pdftohtml\/0.40a: 4 files, 900K, built in 28 seconds
\ntjessel:~ apple$
\n<\/code><\/p>\n

Now… how to compile.<\/p>\n

Apparently you just type in “pdftohtml” in the terminal.<\/p>\n


\ntjessel:~ apple$ pdftohtml
\npdftohtml version 0.40 http:\/\/pdftohtml.sourceforge.net\/, based on Xpdf version 3.01
\nCopyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
\nCopyright 1996-2005 Glyph & Cog, LLC
\nUsage: pdftohtml [options] []\n-f : first page to convert
\n-l : last page to convert
\n-q : don't print any messages or errors
\n-h : print usage information
\n-help : print usage information
\n-p : exchange .pdf links by .html
\n-c : generate complex document
\n-i : ignore images
\n-noframes : generate no frames
\n-stdout : use standard output
\n-zoom : zoom the pdf document (default 1.5)
\n-xml : output for XML post-processing
\n-hidden : output hidden text
\n-enc : output text encoding name
\n-dev : output device name for Ghostscript (png16m, jpeg etc)
\n-v : print copyright and version info
\n-opw : owner password (for encrypted files)
\n-upw : user password (for encrypted files)
\ntjessel:~ apple$
\n<\/code><\/p>\n

So I think I need to navigate to the folder I’m in?
\nCurrently the PDFs I am interested in are on an external flash drive. \u00a0I’m moving it to “Documents” on my HDD.<\/p>\n

So let’s see I have this folder: linkedin profiles-WG-Members<\/p>\n

Tried this, doesn’t work.
\n
\ntjessel:linkedin profiles-WG-Members apple$ pdftohtml *.pdf *.html<\/code><\/p>\n

This apparently works:
\n
\ntjessel:linkedin profiles-WG-Members apple$ pdftohtml TannerJessel.pdf TannerJessel.html<\/code><\/p>\n

This generated three new files in the same directory as “TannerJessel.pdf.”<\/p>\n

1) TannerJessel_ind.html (this is an index)<\/p>\n

2)TannerJessel.html (this is frames and combines both TannerJessels and TannerJessel_ind into one file.<\/p>\n

3)TannerJessels.html (Just the file (wonder if “s” is for single?).<\/p>\n

I’m pasting the output of TannerJessels.html below to inspect (on second thought, I’m not – posting HTML on WordPress with the <code> tag is more trouble than it’s worth, so I am posting a .txt file, TannerJessel<\/a>.txt. (Note: cheat sheet<\/a> for code in wordpress here) I’m unhappy because I was hoping for the headings to be preserved.<\/p>\n

I opened it up and am not very excited about the results.<\/p>\n

From my PDF version of my LinkedIN profile – what fields am I interested in? Let’s open it up.\u00a0TannerJessel<\/a>.pdf.<\/p>\n

Format seems to be this:<\/p>\n