Training deep neural network models for accurate recognition of texts in scenes

Scene text recognition has been a research challenge for many years and is undoubtedly non-trivial due to varying conditions in natural scene images. This technology, however, is highly significant in many vision-based applications beyond document analysis. In this paper, a state-of-the-art neural n...

Full description

Saved in:
Bibliographic Details
Main Author: Lim, Joshen Eng Keat
Other Authors: Lu Shijian
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137977
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Scene text recognition has been a research challenge for many years and is undoubtedly non-trivial due to varying conditions in natural scene images. This technology, however, is highly significant in many vision-based applications beyond document analysis. In this paper, a state-of-the-art neural network architecture that tackles scene text recognition through image-based sequence recognition is studied and its published results are emulated. Experiments are primarily conducted around the tuning of hyper-parameters of the model in efforts to build the best performing model, and the model’s accuracy is measured against two standard benchmark datasets, namely the IIIT 5k-word and the ICDAR13 datasets. Two main refinements were also added to the original implementation, namely early stopping during model training, and fine-tuning of the model. Both enhancements have resulted in the model’s performance improving slightly beyond the published results. Additionally, a program is written to demonstrate the performance and efficiency of the trained text recognition model in both a real-time scenario through a live camera feed and with static images. The program is also able to display the detected texts in the order which they are meant to be read from the image in the latter scenario.