My Dominant Hemisphere

The Official Weblog of 'The Basilic Insula'

How To [Windows/Linux]: OCR On PDFs Using Tesseract and Imagemagick

leave a comment »

OCR

via OCReactive@Flickr (CC BY-NC-SA License)

Howdy readers!

Many moons ago, we met and talked about some of the basics of computer programming. Today I’m going to share with you a BASH shell script that I created using publicly available content as I was trying to OCR a couple of PDFs lying on my hard drive.

OCR is short for “Optical Character Recognition”. OCR software contains algorithms that analyze photographs/scanned images of books, articles, etc. (i.e. text matter) and convert them into plain text such that it can be copy/pasted or manipulated in various forms. For more on what OCR does, see here.

PDFs are ubiquitous these days. And although the file format has been opensourced and standardized, what hasn’t is the way people create PDFs. This gives rise to a plethora of unexpected differences such that two people could create a PDF file from the same input and yet come out with totally different looking PDFs. A lot of this has to do with differences in the way the metadata, layout information, text-layer, embedded fonts, reflow properties, etc. have been stored in the PDF file. For across-the-board accessibility (by people using mobile phones, eReaders, etc.) getting all of these right is absolutely essential.

Sadly, many PDFs of eBooks available online (such as at Archive.org) lack these properties and thus can be a pain to read on small screens. One of the most frequent of problems is that often these PDFs are merely a collection of scanned images of books and articles. And aren’t amenable to note taking, highlighting text, or copy/pasting text, etc. This is where OCR comes into play. Using OCR software one ends up with a file containing text that can then be manipulated to one’s liking. OCR software will obviously omit any pictures or illustrations in its output.

This how-to has been tested on Windows Vista Basic and uses free and open-source software. The script will also work on a Linux system.

  1. Download and install Cygwin from here. Cygwin provides a Linux-like environment on the Windows platform. The default shell that it comes with is BASH. As compared to DOS on Windows, BASH provides a saner way to create tiny programs that can automate tasks. The commands are easier to read and understand.
  2. Run Cygwin and check the output of:
    echo $TERM

    If it says "dumb", then you’re faced with a well-known bug in the installation that doesn’t allow Cygwin to behave properly. To remedy this:

    1. Exit Cygwin.
    2. Click on the Start Menu.
    3. In the field that says “Start Search”, type “Run” and then hit ENTER.
    4. Type sysdm.cpl in the dialogue box that opens.
    5. You are now in the Sytem Properties window. Click on the tab that says “Advanced”. Then click on “Environment Variables”.  Under “System Variables” scroll down to and click on the entry that says “TERM” and click on the “Edit” button at the bottom.
    6. In the box that opens, delete whatever is under “Variable Name” and type cygwin.
    7. Click OK and close the box. Then Click OK and close the “System Properties” box.
    8. Open Cygwin again and see that the output of echo $TERM give you cygwin as the answer.
  3. We’ll need to install a few packages on Cygwin. Install the nano package. Nano is an easy to use text-editor and is more reliable than lame-old Notepad. Notepad can sometimes misbehave and enter invisible control-characters (such as carriage-returns or end-of-files) that Linux systems WILL NOT ignore.
  4. Install the tesseract-ocr, tesseract-ocr-eng, imagemagick and ghostscript packages. Tesseract is the OCR software we shall be using. It works best with English text and supposedly has a reputation for being more accurate than other open-source tools out there. Imagemagick is a set of software tools that allow image manipulation using the command-line. Ghostscript is software that Imagemagick will require in order to work with PDFs.
  5. Open Cygwin. Right click on the title bar of the window and goto Properties. Check (tick-mark) the boxes that say “QuickEdit Mode” and “Insert Mode“. Hit OK. Ignore any error messages that pop-up.
  6. Using nano we will create a BASH script called ocr.sh . This will need to be placed or copied to the directory that contains the PDF file that needs to be OCR’d. Type the following text out manually (exactly as it is) or just copy paste it into nano. After copying text from here, when you right-click inside Cygwin, the text will be pasted inside the window. To save the file hit Ctrl-O. Then hit ENTER. Then exit nano by hitting Ctrl-X.

    Using nano to create a file on Cygwin

    Inside nano

    #!/bin/bash
    
    # Created by Firas MR.
    # Website: http://mydominanthemisphere.wordpress.com
    
    # define variables
    SCRIPT_NAME=`basename "$0" .sh`
    TMP_DIR=${SCRIPT_NAME}-tmp
    OUTPUT_FILE=${SCRIPT_NAME}-output.txt
    
    # make a temporary directory
    
    mkdir $TMP_DIR
    
    # copy PDF to temporary directory
    
    cp $@ $TMP_DIR
    
    # change current working directory to temporary directory
    
    cd $TMP_DIR
    
    # use Imagemagick tool to read PDF pages at a pixel denisty of
    # 150 ppi in greyscale mode and output TIFF files at a pixel
    # depth of 8. Tesseract will misbehave with pixel depth > 8
    # or with color images.
    
    convert -density 150 -depth 8 -colorspace gray -verbose * p%02d.tif
    
    # For every TIFF file listed in numerical order in the temporary
    # directory (contd)
    
    for i in `ls *.tif | sort -tp -k2n`;
    
    do
    
    # strip away full path to file and file extension
    
     BASE=`basename "$i" .tif`;
    
    # run Tesseract using the English language on each TIFF file
    
     tesseract "${BASE}.tif" "${BASE}" -l eng;
    
    # append output of each resulting TXT file into an output file with
    # pagebreak marks at then end of each page
    
     cat ${BASE}.txt | tee -a $OUTPUT_FILE;
     echo "[pagebreak]" | tee -a $OUTPUT_FILE;
    
    # remove all TIFF and TXT files
    
     rm ${BASE}.*;
    
    done
    
    # move output file to parent directory
    
    mv $OUTPUT_FILE ..
    
    # remove any remaining files (eg. PDF, etc.)
    
    rm *
    
    # change to parent directory
    
    cd ..
    
    # remove temporary directory
    
    rmdir $TMP_DIR
    
  7. Next we’ll need to make the file executable by all users. To do this type
    chmod a+x ocr.sh

    and hit ENTER.

  8. Change directories to where the PDF file is located. Eg: in order to change directories to the C: drive in Cygwin you need to do:
    cd /cygdrive/c/

    List contents by typing

    ls -al

    Copy ocr.sh to the directory that contains your PDF. Do this by typing

    cp ~/ocr.sh .

    (That dot is not a typo!). Rename the PDF to a simple name without hyphens or weird characters. Make it something like bookforocr.pdf . You can do this by typing

    mv <name of PDF file> bookforocr.pdf
  9. Type ./ocr.sh bookforocr.pdf and observe as your computer chugs away :-) ! You’ll end up with a file called ocr-output.txt containing the OCR’d data from the book! Imagemagick will use up quite a bit of RAM memory as it works on the PDF. Expect some sluggishness in your computer as it does this.
  10. You can convert the txt file into anything you like. For example an EPUB file using Calibre that can then be uploaded to an eReader such as the B&N NOOK :-).

One could modify the script to crop, set white-points, etc. for anything fancier. For Windows users who like a GUI, a good open-source cropping tool for PDFs is BRISS. It is a great boon for easily cropping multi-column text matter. Another great tool for the same purpose is Papercrop (although, since it rasterizes its output you notice a significant decrease in quality).

A Linux Journal article describes how to find out position co-ordinates for cropping using GIMP.

Another way that I discovered to OCR a PDF is to use OCRopus. It claims to have automatic and intelligent layout analysis for dealing with stuff like multiple columns, etc.

Alrighty then. See you next time! Feel the OCR power on your PDFs :-) !

# Footnotes:

Ubuntuforums Howto on OCR
Circle.ch: How to OCR multipage PDF files
The Kizz Notes: cygwin: WARNING: terminal is not fully functional


Copyright Firas MR. All Rights Reserved.

“A mote of dust, suspended in a sunbeam.”


Search Blog For Tags: , , , , ,

About these ads

Written by Firas MR

March 20, 2011 at 8:58 am

Posted in Technology, Unix

Tagged with , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: