My manager had a series of pdf files that were generated from some reporting tool. Each contained a graph, and some text. He wanted to get the text out and put it into Word so it could be further manipulated. Easy enough, right? Open the file in Reader, select the text with your mouse, Control-C, go to Word, Control-V. Done.
However, as we in the world of IT know, it’s often not that simple.
He was unable to select the text, so the plan fell at step 1.
Other tricks, such as opening the file in Libre Office or MS Office also failed to produce text that could be extracted.
No problem, I figured that I’d just take the files home and use the other tools at my disposal on my Linux box.
Opening the file in Evince Document Viewer allowed me to select the text, but pasting it anywhere produced gibberish. I had thought that the text was actually an image, but since I could select it, I wasn’t sure.
I tried extracting images using PDF Toolkit using the unpack_files command, but that also failed.
I now suspected that the text was encoded in some strange character set, or used some other trick purposely to make it hard to get the text out. Why is anyone’s guess, since the text was user supplied comments, not some magic propriety analytic data.
Anyway, after a little head scratching and some internet searching, I came up with an rather simple alternative: convert the pdf to an image, and ocr it.
This turned out to be simple to do, at least on my home Linux box. I didn’t try on my work Windows machine, but as the tools required are available for Windows, it should work there too.
First, I removed the page with the graph, to avoid confusing the ocr tool. This probably wasn’t necessary, but I figured that it wouldn’t hurt. As the graph was on the first page on it’s own, I simply removed it:
pdftk Clarity.pdf cat 2-end output Clarity-1.pdf
Next, using Imagemagick, I converted the new file to a high resolution image:
convert -density 300 Clarity-1.pdf -depth 8 -strip -background white -alpha off Clarity.tiff
- -density 300 and -depth 8 control the resolution of the resulting TIFF image. OCR works best with high-resolution images; if you leave this out, you’re likely to get garbled results.
- -strip -background white -alpha off removes any alpha channels, and makes the background white.
Then, I used Google’s Tesseract OCR engine to extract the text
tesseract Clarity.tiff Clarity.txt
and viola, there it was. A simple text file with all of the required text.
The OCR was almost 100% accurate. When reviewing the output, I did notice 1 small error, where Tesseract gave 11 instead of ll, but that was all. Certainly, my manager did not complain about the results. If fact, he presented me with another batch of files to process. So it seems I’ve inherited another task. D’oh!