OCR with Deep Learning in PyTorch (Recognizing Irregular Text)

Part 2: Advanced Scene Text Recognition

5 min readMay 8, 2023

This is the second tutorial of the series explaining Optical Character Recognition OCR with deep learning. The first part of this series introduced the topic with Python code implementation. In this part, we will go through more advanced scene text recognition methods.

Introduction

Scene text recognition is essential in several computer vision tasks such as traffic sign reading, product recognition, intelligent inspection, and image searching. In spite of the success of Convolutional CNN and Recurrent RNN neural networks in advancing OCR, the simple recognition methods do not address the challenges of irregular text inside images. The following figure compares regular (horizontal and frontal) vs irregular texts:

Examples of regular and irregular scene text. (a) Regular (b) Slanted and perspective text. (c) Curved text. This is Figure 1 in MORAN paper.

Irregular text widely appear in natural scenes. However, it is difficult to recognize due to the large variance in its shapes and distorted patterns. The following figure depicts more instances of irregular texts, which are typical cases of oriented, perspective and curved texts:

Several techniques have been introduced to tackle the aforementioned challenges of irregular text recognition. The basic idea behind tackling these challenges is to rectify the image before text recognition. Image rectification helps to transform the irregular text into an easier regular one. In this tutorial, we will go through two of such methods of scene text recognition: Attentional Scene TExt Recognizer with flexible rectification (ASTER) and Multi-Object Rectified Attention Network (MORAN).

ASTER

ASTER tackles the irregular text problem with an explicit rectification mechanism. As depicted in the next figure, the model comprises two parts: the rectification network and the recognition network.

Given an input image, the rectification network transforms the image to rectify the text inside it. The resulting rectified image is passed to the recognition network that predicts the output character sequence.

More specifically, the rectification network is trained to predict the transformation parameters required to rectify the irregular text into a regular version. During inference, the rectification network predicts the transformation parameters and apply them to the image as shown in the next figure. This network is based on Thin Plate Spline (TPS) transformation and the Spatial Transformer Networks (STN) framework.

TPS transformation of irregular texts. The green + show the control points used to estimate the transformation parameters. For more details, check ASTER paper.

The recognition network predicts a character sequence directly from the rectified image in an attentional sequence-tosequence manner. The following figure depicts the structure of model following the classical sequence-to-sequence architectures consisting of an encoder and a decoder.

Structure of the basic text recognition network.

The sequence-to-sequence model translates the feature sequence output by the encoder into a character sequence output by the decoder. The encoder takes the rectified image as its input, to which it applies the CNNs (ConvNet above) to extract the image features. These features are sent to the Bi-directional LSTM (BLSTM) that maps/encodes them into a sequence. The decoder transforms the features sequence output from the encoder into a sequence of characters (text) of arbitrary lengths. For more details (including the attention mechanism), you may check the ASTER paper.

ASTER Code

ASTER is built with PyTorch and the code is available here on GitHub . After cloning the repo and installing the required packages, head to the demo section, where you can test the code on your images. First, you need to download the pre-trained models from the release page. Then edit config.py to set the relevant parameters e.g. path to the model, path to your images, etc. Finally, run the following command from the terminal:

python aster/demo.py

MORAN

MORAN was presented to read rotated, scaled and stretched characters in different scene texts. Similar to ASTER, it consists of two parts: a multi-object rectification network (MORN) to rectify images, and an attention-based sequence recognition network (ASRN) to read the text.

In contrast to ASTER, the rectification introduced inside the MORN network is not based on geometric constraints/transformation. The rectification network in MORN employs convolutional layers (along with the pooling, batch normalization and ReLU) to predict the offset of each part of the image. The next figure shows example rectification results with the computed offsets:

The input images are shown on the left and the resulting rectified images are shown on the right. The heat maps at the middle visualize the computed offset maps

Similar to ASTER, the structure of the recognition network ASRN is a CNN-BLSTM encoder-decoder framework with an attention mechanism. However, Gated Recurrent Units (GRUs) are used instead of LSTMs to build the decoder. For more details (including the fractional pickup), please check the paper.

MORAN Code

MORAN is also built with PyTorch and the code is available on GitHub. Follow the instructions on the page to clone the repo and install the requirements. Then, download the pretrained model demo.pth to your local drive. Modify model_path and img_path in demo.py before running the following from your terminal:

 python demo.py

Conclusions

For regular text recognition, you can use the EasyOCR code explained in the first part of this series. However, recognition of irregular texts in natural scenes is challenging due to the large variance of distortions in the text shapes and patterns. We have reviewed two of the techniques tackling such challenges: ASTER and MORAN. Both methods solve the task through two separate networks: rectification and recognition. The rectification network rectifies the image batch of the text before passing the result to the recognition network, which is structured as an encoder-decoder framework with attention. You are encouraged to test the codes of ASTER and MORAN on several images before deciding the best choice for your business.

The next part of this series will handle the severe case of text recognition in the wild (also with PyTorch in Python).