Sunday 15 July 2018

Extracting text from an image using pytesseract

What is OCR?

Optical Character Recognition(OCR) is the process of electronically extracting text from images or any documents like PDF and reusing it in a variety of ways such as full text searches.

pytesseract:

Tesseract is an open source Optical Character Recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract typed, handwritten or printed text from images. It supports a wide variety of languages. It will recognize and read the text present in images. It can read all image types – png, jpeg, gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.

Installation:

$ sudo pip install pytesseract
Requirements:
In Python Script:
from PIL import Image
import pytesseract
im = Image.open('textimg.pngext = pytesseract.image_to_string(im)
print (text)
Save the bellow image file on your system.
aa
After execution of the python file you can show the output like bellow image.
imagetotext output.jpg

No comments:

Post a Comment