Nirav Mistry

Jul 12, 20205 min

Optical Character Recognition (OCR) using (Py)Tesseract : Part 1

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.

Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, BMP, tiff, and others. Additionally, if used as a script, Python-tesseract will print the recognized text instead of writing it to a file.

We're going to start experimenting with tesseract using just a simple image of nice clean text.

Lets first import Image from PIL and display the image text.png.

from PIL import Image
 

 
image = Image.open("../input/ocr.png")
 
display(image)

Great, we have a base image of some big clear text

Let's import pytesseract and use the dir() function to get a sense of what might be some interesting functions to play with.

import pytesseract
 
dir(pytesseract)

['Output',
 
'TSVNotSupported',
 
'TesseractError',
 
'TesseractNotFoundError',
 
'__builtins__',
 
'__cached__',
 
'__doc__',
 
'__file__',
 
'__loader__',
 
'__name__',
 
'__package__',
 
'__path__',
 
'__spec__',
 
'get_tesseract_version',
 
'image_to_boxes',
 
'image_to_data',
 
'image_to_osd',
 
'image_to_pdf_or_hocr',
 
'image_to_string',
 
'pytesseract',
 
'run_and_get_output']

It looks like there are just a handful of interesting functions, and I think image_to_string is probably our best bet. Let's use the help() function to interrogate this a bit more

help(pytesseract.image_to_string)

Help on function image_to_string in module pytesseract.pytesseract:
 

 
image_to_string(image, lang=None, config='', nice=0, output_type='string')
 
Returns the result of a Tesseract OCR run on the provided image to a string.
 

Ok, let's try and run tesseract on this image

text = pytesseract.image_to_string(image)
 
print(text)

See the magic of OCR using
 
pytessaract. we will be able to
 
read the content of image and
 
convert it to text.

In the previous example, we were using a clear, unambiguous image for conversion. Sometimes there will be noise in images you want to OCR, making it difficult to extract the text. Luckily, there are techniques we can use to increase the efficacy of OCR with pytesseract and Pillow.

Let's use a different image this time, with the same text as before but with added noise in the picture.

We can view this image using the following code.

from PIL import Image
 
img = Image.open("../input/OCR/ocr-Noisy.png")
 
display(img)

As you can see, this image had shapes of different opacities behind the text, which can confuse the tesseract engine. Let's see if OCR will work on this noisy image

import pytesseract
 
text = pytesseract.image_to_string(Image.open("../input/OCR/ocr-Noisy.png"))
 
print(text)

See the magic of
 

 
pytessaract. we
 

 
read the content of image and
 
convert it to text.

This is a bit surprising given how nicely tesseract worked previously! Let's experiment on the image using techniques that will allow for more effective image analysis. First up, let's change the size of the image

# First we will import PIL
 
import PIL
 
# Then set the base width of our image
 
basewidth = 600
 
# Now lets open it
 
img = Image.open("../input/OCR/ocr-Noisy.png")
 
# We want to get the correct aspect ratio, so we can do this by taking the base width and dividing
 
# it by the actual width of the image
 
wpercent = (basewidth / float(img.size[0]))
 
# With that ratio we can just get the appropriate height of the image.
 
hsize = int((float(img.size[1]) * float(wpercent)))
 
# Finally, lets resize the image. antialiasing is a specific way of resizing lines to try and make them
 
# appear smooth
 
img = img.resize((basewidth, hsize), PIL.Image.ANTIALIAS)
 
# Now lets save this to a file
 
img.save('resized_nois.png') # save the image as a jpg
 
# And finally, lets display it
 
display(img)
 
# and run OCR
 
text = pytesseract.image_to_string(Image.open('resized_nois.png'))
 
print(text)

See the magic of
 
pytessaract. we
 
read the content of image and
 
convert it to text.

Hrm, no improvement for resizing the image. Let's convert the image to greyscale. Converting images can be done in many different ways. If we poke around in the PILLOW documentation we find that one of the easiest ways to do this is to use the convert() function and pass in the string 'L'

img = Image.open('../input/OCR/ocr-Noisy.png')
 
img = img.convert('L')
 
# Now lets save that image
 
img.save('greyscale_noise.jpg')
 
# And run OCR on the greyscale image
 
text = pytesseract.image_to_string(Image.open('greyscale_noise.jpg'))
 
display(img)
 
print(text)

magic of
 
Saract. we
 
read the content of image and
 
convert it to text.

There is no significant improvement, so there are a few other techniques we could use to help improve OCR detection in the event that the above two don't help. The next approach I would use is called binarization, which means to separate into two distinct parts - in this case, black and white. Binarization is enacted through a process called thresholding. If a pixel value is greater than a threshold value, it will be converted to a black pixel; if it is lower than the threshold it will be converted to a white pixel. This process eliminates noise in the OCR process allowing greater image recognition accuracy. With Pillow, this process is straightforward.

Let's open the noisy image and convert it using binarization

img = Image.open('../input/OCR/ocr-Noisy.png').convert('1')
 
# Now lets save and display that image
 
img.save('black_white_noise.jpg')
 
display(img)

So, that was a bit magical, and really required a fine reading of the docs to figure out

# that the number "1" is a string parameter to the convert function actually does the binarization. But you actually have all of the skills you need to write this functionality yourself.

Let's walk through an example. First, let's define a function called binarize, which takes in

an image and a threshold value:

def binarize(image_to_transform, threshold):
 
# now, lets convert that image to a single greyscale image using convert()
 
output_image=image_to_transform.convert("L")
 
# the threshold value is usually provided as a number between 0 and 255, which
 
# is the number of bits in a byte.
 
# the algorithm for the binarization is pretty simple, go through every pixel in the
 
# image and, if it's greater than the threshold, turn it all the way up (255), and
 
# if it's lower than the threshold, turn it all the way down (0).
 
# so lets write this in code. First, we need to iterate over all of the pixels in the
 
# image we want to work with
 
for x in range(output_image.width):
 
for y in range(output_image.height):
 
# for the given pixel at w,h, lets check its value against the threshold
 
if output_image.getpixel((x,y))< threshold: #note that the first parameter is actually a tuple object
 
# lets set this to zero
 
output_image.putpixel( (x,y), 0 )
 
else:
 
# otherwise lets set this to 255
 
output_image.putpixel( (x,y), 255 )
 
#now we just return the new image
 
return output_image

let's test this function over a range of different thresholds. Remember that you can use

the range() function to generate a list of numbers at different step sizes. range() is called

with a start, a stop, and a step size. So let's try range(0, 257, 64), which should generate 5

images of different threshold values

for thresh in range(0,257,64):
 
print("Trying with threshold " + str(thresh))
 
# Lets display the binarized image inline
 
display(binarize(Image.open('../input/OCR/ocr-Noisy.png'), thresh))
 
# And lets use tesseract on it. It's inefficient to binarize it twice but this is just for
 
# a demo
 
print(pytesseract.image_to_string(binarize(Image.open('../input/OCR/ocr-Noisy.png'), thresh)))

Trying with threshold 0

Trying with threshold 64

See the magic of OCR using
 
pytessaract. we will be able to
 
read the content of image and
 
convert it to text.


 
Trying with threshold 128

magic of OCR using
 
saract. we will be able to
 
read the content of image and
 
convert it to text.

Trying with threshold 192

magic of
 
Saract. we
 
read the content of image and
 
convert it to text.

Trying with threshold 256

We can see from this that a threshold of 0 essentially turns everything white, that the text becomes bolder as we move towards a higher threshold, and that the shapes, which have a filled-in grey color, become more evident at higher thresholds. In the next lecture, we'll look a bit more at some of the challenges you can expect when doing OCR on real data

    27400
    1