OCR with Deep Learning in PyTorch (Low Resolution OCR in the Wild)

Part 3: Recognizing Low-Resolution Text Images in the Wild

Mohamed Hasan
5 min readMay 12, 2023

This is the last part of the series explaining Optical Character Recognition OCR with deep learning. The first part introduced the topic with Python implementation, whereas the second part presented more advanced scene text recognition methods.

In this part, we will discuss the challenges of low-resolution text images that are often seen in natural scenes. Scene texts are of arbitrary shapes, distributed illumination, and different backgrounds, which makes the recognition even harder at the low resolution.

Introduction

The performances of most OCR methods drop sharply when recognizing low-resolution (LR) text images. This is because the shape of LR characters is extremely blurred due to optical degradation. The following figure shows examples of such degradation. Try to guess the text inside the images before reading the answer below!

Hard examples of low-resolution text images.

As you can see, it is even hard for a human to recognize the text in such images (2018, CONSTRUCTION, Nutrition, author). It is straightforward then to think of boosting the image resolution before text recognition. The line of research working on this problem is called Super Resolution (SR).

Super Resolution aims at generating a high-resolution (HR) image consistent with a given low-resolution input. Traditional approaches employed filters (e.g. bilinear or bicubic) which generate a HR output for each pixel by interpolating the colors of its neighbors. Deep learning SR is casted as a regression problem, where the input is the LR image, and the output is the HR image. A deep network is thus trained on pairs of LR input and HR target to minimize some distance metric between the prediction and groundtruth.

The question now is how to collect/generate such pairs of LR and HR images? There are two main types of datasets used to train and evaluate SR networks. In the first type, LR images are synthesized by a down-sample interpolation or Gaussian blur filtering of given HR images. This kind of uniform degradation isn’t enough to generalize the real LR in the wild, as real blur scene text images are more varied in degradation formation.

Datasets of the other type are collected through capturing LR-HR image pairs by adjusting the focal length of the cameras. This generates real (not synthesized) LR text images. Thus, models trained on such datasets can better generalize to real scene challenges.

We will go through one of these challenging datasets (TextZoom) and the model (TSRN) trained to predict texts in such LR images.

TextZoom

This can be considered as the first real text SR dataset. TextZoom contains paired LR and HR text images of the same text content. Images are captured by cameras with different focal lengths in the wild, that is more authentic and challenging than synthetic data. The next figure compares the synthetic vs the real LR text images, where it is clear that the real LR images are much more challenging than the synthetic ones.

Comparison between synthetic LR, real LR, and HR images in TextZoom. ‘Syn LR’ denotes BICUBIC down-sampled image of HR. ‘Real LR’ and ‘HR’ denotes LR and HR images captured by camera with different focal lengths.

TextZoom is well annotated by providing the direction, the text content and the original focal length of the text images. The dataset contains abundant text from different natural scenes, including street views, libraries, shops, vehicle interiors and so on. Text images are divided into three subsets based on difficulty: easy, medium and hard.

Subsets of TextZoom data based on difficulty.

More details of the dataset collection and statistics can be found in the paper.

TSRN

Text Super-Resolution Network (TSRN) is a text-oriented end-to-end model trained to reconstruct LR text images. The input to the model is the LR image, and the output is the HR result. The pipeline of the TSRN is shown below:

Overview of TSRN model.

The pipeline can be explained through the following steps:

  • The LR input (RGBM) is a concatenation of RGB channels with a binary mask.
  • The RGBM input is rectified by the central alignment module (Align). This modules solve the problem of misalignment of the paired images.
  • CNN layers are then applied to extract shallow features from the rectified image.
  • A stack of five Sequential Residual Blocks (SRBs) extract deeper and sequential dependent feature. As shown above each SRB includes convolutional layers (CNN to extract the deeper image features) and recurrent layers (bi-directional LSTM to extract the sequential features). A residual skip connection similar to ResNet is also used.
  • The SR images are finally generated by up-sampling block and CNN.
  • The output is a super-resolved RGB image.

The output of the network is supervised by MSE Loss (L2) and gradient profile loss (LGP) which is a boundary-aware loss term designed to reconstruct the sharp boundary of the characters.

Code

TextZoom dataset and TSRN code can be accessed from this GitHub repo. Follow the instructions on there to install the requirements and download the data and/or pre-trained models. You can easily train and test TSRN model in addition to other architectures (CRNN, ASTER, MORAN).

I tested TSRN on the hard images shown at the first figure in this article, and got the following results:

Testing TSRN on images from the hard subset of TextZoom.

The left column shows the LR input images, whereas the middle column shows the HR groundtruth images with groundtruth text above each image. The SR images restored from TSRN are shown in the right column, where the predicted text is written above each image.

If you want to run a demo of TSRN on your images, you need to run the following:

python main.py --demo --demo_dir='./images/'  --resume='your-model.pth' --STN --mask

Conclusions

Recognizing low-resolution text images in the wild is a challenging task. This is mainly due to the optical degradation of text characters in natural images at such LR. It is intuitive then to boost the resolution of such images before text recognition. TSRN is a text-oriented end-to-end model that jointly outputs high-resolution text images and predict the characters inside. The model is trained on TextZoom, a dataset of paired LR and HR text images that were captured by cameras with different focal lengths in the wild. Testing TSRN on the hard examples of TextZoom verifies the effectiveness of this approach. The results also clarify its limitations with severely hard images.

To integrate this framework into your work and get better results, you will need to train/fine-tune these models on your images. If you have any questions, please write in the comments.

References

Paper

Code

--

--