Extracting text from the image and translation using Tesseract and Yandex API

Presenting by Madhan Kumar Selvaraj

Text extraction from the image and translation

In this blog, we will learn about the process of extracting the text from the image by using the open-source Google's Optical Character Recognition OCR engine Tesseract.

Then we will use the multiple language detector and translator Yandex API to detect and translate the language of the text and the image.

What is OCR ?

OCR refers to Optical Character Recognition. As a human we can able to recognize the text from the image but what about the computer and how it can read or parse the data from the image.

At that time we found the OCR and inside the OCR engine, there will be computer vision for the image processing and machine learning technique to train the machine with training data set using certain algorithms to improve the accuracy.

What are the benefits of learning and using OCR?

The ultimate benefit of the OCR is to convert the hard copy into the digital format. Still in this digital world, we are using notebooks and paperback for entering the data. Data entry jobs can be done rapidly with reliability and with more accuracy.

What we'll use in this simple OCR and translation project?

Before moving to the coding part, we need a few prerequisites to work on it

Python

Have a quick tour in this python-tutorial and you will get to know about the installation process and basics of python
Download pycharm IDE community version for better coding

Pip installation

Pip wheel contains lots of packages and it'll be very useful
Install pip using this installation procedure
Reference - pip installation and setting environment variable

Setting virtual environment in Python

We are also going to use Pipenv since it also handles the virtual-environment setup and requirements management.
Reference - Setting python virtual environment

Python Imaging Library

Pillow library is a fork of the Python Imaging Library (PIL) to handle the opening and manipulation of images in many formats in Python.
Reference - pillow installation using pip

Python flask

The Flask web framework to create our simple OCR server where we can take pictures via the webcam or upload photos for character recognition purposes.
Go through the basic Flask tutorial

Tesseract OCR

Python-Tesseract, or simply PyTesseract, the library which is a wrapper for Google's Tesseract-OCR Engine and it is completely open-source and being developed and maintained by the giant that is Google. Follow these instructions to install Tesseract on your machine, since PyTesseract depends on it.

Let's construct our OCR

Now I am creating a simple OCR python program ocr.py to extract the text from the image file. image_to_string in the tesseract OCR helps to extract the text from the image.

pytesseract.image_to_string(Image.open(filename), lang='eng+tam')

In this example, OCR can capable of extracting the English as well as the Tamil language in the image. We can change the modify the language in the lang keyword

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

def ocr_core(filename):
    # Path of the tesseract engine
    pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    # Pillow's Image class to open the image and pytesseract to detect the string in the image
    text = pytesseract.image_to_string(Image.open(filename), lang='eng+tam')
    return text
# Path of the image file
print(ocr_core(r"E:\Coding\Python\OCR\Images\sample_image.PNG"))

Image file to extract the text

Open the terminal where the python file located and run the program by python ocr.py by following the below code and you will get the desired result

Yandex API for language detection and translation

Have a glance at the Yandex documentation to know about the working technique of the Yandex. We need an API key to call the Yandex and sign up for the website to get the API key.

Flask web interface

Learn the basic flask tutorial because our project is mainly integrated with it.

Complete OCR model with language detection and translation

Create a python program main.py under the folder name OCR. This program is doing a few important tasks as listed below

It is the main function where it callbacks the other vital functions

OCR function to extract the text
Language detection function to detect the language
Language translation function to translate the text and text from the OCR image

Checks the allowed image file extensions while uploading the file
Links with Python flask web framework

Comments in each line explain the purpose and the use of the code.

# Import 'flask' for web development
from flask import Flask, render_template, request

# importing 'ocr_core' function from the 'OCR' python file
from ocr import ocr_core
# importing 'language_detector' function from the 'api_detect' python file
from api_detect import language_detector
# importing 'translate' function from the 'api_translate' python file
from api_translate import translate

# define a folder to store and later serve the images
UPLOAD_FOLDER = "E:\\Coding\\Python\\OCR\\ocr_image_process"

# allow files of a specific type
ALLOWED_EXTENSIONS = set(['png', 'jpg', 'jpeg'])

app = Flask(__name__)
# app.debug = True


# function to check the file extension
def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

# route and function to handle the upload page
@app.route('/', methods=['GET', 'POST'])
def upload_page():
    index = 0
    if request.method == 'POST':
        # check if there is a file in the request
        if ('file' in request.files) or ('file1' in request.files):
            # This condition will execute when user tries to extracts the text along with translation
            if request.files['file'].filename == "":
                index = 1
                file = request.files['file1']
            else:
                file = request.files['file']
            # Check language choose by the user to translate
            if request.form['languages']:
                translate_language = request.form['languages']

            if file and allowed_file(file.filename):

                # call the OCR function on it
                extracted_text = ocr_core(file)
                # Condition to both text extraction and translation
                if index == 1:
                    detected_language = language_detector(extracted_text)
                    translated_text = translate(extracted_text, detected_language, translate_language)
                    return render_template('upload.html',
                                           extracted_text=extracted_text,
                                           detected_language= detected_language,
                                           translated_text=translated_text)
                # Extracts the text from the image
                else:
                    return render_template('upload.html',
                                           extracted_text=extracted_text)
        # Condition for only translation from the text
        if 'translate' in request.form:
            # Check language choose by the user to translate
            if request.form['languages']:
                translate_language = request.form['languages']
            text = request.form['translate']
            # Calling 'language_detector' function to detect the language
            detected_language = language_detector(text)
            # Condition when the API fails to detect the language
            if detected_language == "":
                detected_language = "English"
            # Calling 'translate' function to translate the text
            translated_text = translate(text, detected_language, translate_language)
            return render_template('upload.html', translated_text=translated_text,detected_language=detected_language)
    # 'GET' request
    elif request.method == 'GET':
        return render_template('upload.html')


if __name__ == '__main__':
    app.run(debug=True)

Create api_translate.py python program to translate the image. In this script, we need to use the Yandex API key to call the Yandex as mentioned earlier. Use your Yandex API key in the api_key variable.

import requests
import json
def translate(parsed_text, detected_language, translate_language):
    # List of language code available to translate the text data
    language_dict = {'Tamil': 'ta', 'Hindi': 'hi', 'Telugu': 'te', 'Kannada': 'kn', 'English': 'en'}
    # Condition when both 'translate_language' and 'detected_language' are available
    if (translate_language in language_dict.keys()) and (detected_language in language_dict.keys()):
        translate_language_code = language_dict[translate_language]
        detected_language_code = language_dict[detected_language]
    # condition when our function can't dect the 'detected_language'
    elif (detected_language == "Can't detect the language") and (translate_language in language_dict.keys()):
        translate_language_code = language_dict[translate_language]
        detected_language_code = "en"
    else:
        # Default languages
        detected_language_code = "en"
        translate_language_code = "ta"
    api_key = "Use your Yandex API key" 
    url = api_key+"&text="+parsed_text+"&lang="+detected_language_code+"-"+translate_language_code+"&[format=plain]"
    # API call through python
    response = requests.get(url)
    # Converting to JSON format
    result = json.loads(response.text)
    # Condition to check the translated text
    if 'text' in result.keys():
        return result['text'][0]
    else:
        # When function fails to execute
        return "Can't translate"

Create an api_detect.py python program to translate the image. In this script, we need to use the Yandex API key to call the Yandex as mentioned earlier. Use your Yandex API key in the api_key variable.

import requests
import json
# Text received from Flask
def language_detector(text_data):
    api_key = "Use your Yandex API key"
    url = api_key + "&text=" + text_data
    # Python API call to Flask
    response = requests.get(url)
    # Converting data to JSON format
    result = json.loads(response.text)
    # MAtching language code with the user's language input
    language_dict = {'ta': 'Tamil', 'hi': 'Hindi', 'te': 'Telugu', 'kn': 'Kannada', 'en': 'English'}
    # Checking whether language is available
    if ('lang' in result.keys()) and (result['lang'] != ""):
        lang = result['lang']
        return language_dict[lang]
    else:
        return "Can't detect the language"

Create an ocr.py python program to extract text from the image using Google's Tesseract Optical Character Engine.

try:
    # Importing Pillow for image processing
    from PIL import Image
except ImportError:
    import Image
    # Importing Tesseract for OCR engine
import pytesseract

def ocr_core(filename):
    # Path of the tesseract engine
    pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    # Pillow's Image class to open the image and pytesseract to detect the string in the image
    # Extracts when both English and Tamil language present in the image
    text = pytesseract.image_to_string(Image.open(filename), lang='eng+tam')
    return text

Now we are going to create an upload.html Html file in the template folder under the OCR folder. Used to upload the image from the browser and to enter the text data in the text box and to select the language to translate. Save the file in the template folder.

Saving the Js and Jquery files folder under the OCR folder. These help to do click function in the web application.

Html, JS and Jquery files should be in the below directory path.

All the programming work completed. Now it's time to check and execute our program. Use workon keyword to create a new python project in the python virtual environment

Now run out main.py python project in the terminal

Open the link http://127.0.0.1:5000/ in the browser and you will get the below web page

Click the first radio button to extract the text from the image. Here our Js file works while clicking on the radio button. In this project, we used only the English and the Tamil language. We can add languages separately from the Tesseract.

I used the sample image that I used previously. Here's our web page result after processing OCR.

Sample Tamil language image file to extract the text data

OCR result that we got for this image

Now in the next radio button option, we can translate the text data to any language. Currently, I used only five languages if you want you can add more languages in the api_detect.py and api_translate.py python program. In addition to that, it can predict the language from which text language we try to translate

In this example, I selected the Hindi language to translate my text and here's the output

In the third option, we can do both operations simultaneously. It can extract the text data from the image and it translates the text according to the user's wish

Here the translated output contains a lot of error and this result we got from the Yandex API. We can improve accuracy by using any other translation API's

Code in GitHub

If you are too lazy like me to read all the code and procedure. Then here is the code for you.

Finally, you learned about

Implementing Computer Vision (CV) in this image processing can increase the accuracy while extracting the text from the image
Other than the image format, PDF's and multiple format data can be extracted by using Textract package
Language detection and translation can be improved by training more data set by using the Machine Learning (ML) algorithms

Reference

P.S - Most of the people in this generation are workaholic and they are not spending their quality time with their parents, spouse, children who they are thinking them as their world. You can earn money but not love. So spend your time with them.

Search This Blog

Madhan Kumar Selvaraj's blog