Accurate and robust detection and recognition of texts in scene

Scene text detection and recognition aim to localize the texts in natural scene images and output corresponding character sequences of texts. Automated scene text detection and recognition have attracted increasing interest in computer vision and deep learning communities due to its wide range of ap...

Full description

Saved in:
Bibliographic Details
Main Author: Xue, Chuhui
Other Authors: Lu Shijian
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/157998
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-157998
record_format dspace
spelling sg-ntu-dr.10356-1579982022-06-03T14:25:12Z Accurate and robust detection and recognition of texts in scene Xue, Chuhui Lu Shijian School of Computer Science and Engineering Shijian.Lu@ntu.edu.sg Engineering::Computer science and engineering Scene text detection and recognition aim to localize the texts in natural scene images and output corresponding character sequences of texts. Automated scene text detection and recognition have attracted increasing interest in computer vision and deep learning communities due to its wide range of applications in neural machine translation, autonomous driving, etc. As compared with preliminary research that focuses on the design of hand-crafted features, modern deep-learning-based techniques have achieved significant improvements on scene text detection and recognition tasks. Such frameworks usually deploy convolutional neural networks (CNN), recurrent neural networks (RNN), or Transformers to extract image features for accurate text detection and recognition. However, automated detecting and recognizing texts in scenes remain challenging due to the complexity of scene text images. First, texts in scenes exhibit high variability and diversity in appearance due to the complex patterns of texts (e.g., colors, fonts, etc.) and various environments (e.g., lighting, occlusion, etc.). Second, scene texts usually have different lengths, orientations, and shapes that may suffer from both perspective and curvature distortions. Third, scene images usually have complex backgrounds that may contain similar patterns with texts (e.g., trees, traffic signs, etc.). Either of them will lead to incorrect prediction in scene text detection and recognition task. In this thesis, we propose several novel techniques for scene text detection and recognition that aim to produce more accurate detection and recognition of scene texts in different orientations, lengths, sizes, and shapes. First, we design a novel scene text detection approach that detects texts through border semantics awareness and bootstrapping. We introduce a bootstrapping technique that samples multiple `subsections' of a word or text line and accordingly relieves the constraint of limited training data effectively. In addition, a semantics-aware text border detection technique is designed which produces four types of text border segments for text detection. Second, we develop a novel multi-scale shape regression network (MSR) for accurate scene text detection. It detects scene texts by predicting dense text boundary points instead of sparse quadrilateral vertices which often suffers from regression errors while dealing with long text lines. Additionally, the multi-scale network extracts and fuses features at different scales concurrently and seamlessly which demonstrates superb tolerance to the text scale variation. Third, we design a mask-guided multi-task network that reliably detects and rectifies scene texts of arbitrary shapes. The proposed network detects text keypoints and landmark points for accurate text detection and rectification. Forth, we propose a novel scene text recognition method I2C2W that is tolerant to geometric and photometric degradation by decomposing scene text recognition into two inter-connected tasks and leveraging the advances of Transformer architecture. Extensive experiments show that the proposed techniques can accurately detect and recognize texts with various lengths, orientations, and shapes from natural scene images. Doctor of Philosophy 2022-05-16T10:16:15Z 2022-05-16T10:16:15Z 2022 Thesis-Doctor of Philosophy Xue, C. (2022). Accurate and robust detection and recognition of texts in scene. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/157998 https://hdl.handle.net/10356/157998 10.32657/10356/157998 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Xue, Chuhui
Accurate and robust detection and recognition of texts in scene
description Scene text detection and recognition aim to localize the texts in natural scene images and output corresponding character sequences of texts. Automated scene text detection and recognition have attracted increasing interest in computer vision and deep learning communities due to its wide range of applications in neural machine translation, autonomous driving, etc. As compared with preliminary research that focuses on the design of hand-crafted features, modern deep-learning-based techniques have achieved significant improvements on scene text detection and recognition tasks. Such frameworks usually deploy convolutional neural networks (CNN), recurrent neural networks (RNN), or Transformers to extract image features for accurate text detection and recognition. However, automated detecting and recognizing texts in scenes remain challenging due to the complexity of scene text images. First, texts in scenes exhibit high variability and diversity in appearance due to the complex patterns of texts (e.g., colors, fonts, etc.) and various environments (e.g., lighting, occlusion, etc.). Second, scene texts usually have different lengths, orientations, and shapes that may suffer from both perspective and curvature distortions. Third, scene images usually have complex backgrounds that may contain similar patterns with texts (e.g., trees, traffic signs, etc.). Either of them will lead to incorrect prediction in scene text detection and recognition task. In this thesis, we propose several novel techniques for scene text detection and recognition that aim to produce more accurate detection and recognition of scene texts in different orientations, lengths, sizes, and shapes. First, we design a novel scene text detection approach that detects texts through border semantics awareness and bootstrapping. We introduce a bootstrapping technique that samples multiple `subsections' of a word or text line and accordingly relieves the constraint of limited training data effectively. In addition, a semantics-aware text border detection technique is designed which produces four types of text border segments for text detection. Second, we develop a novel multi-scale shape regression network (MSR) for accurate scene text detection. It detects scene texts by predicting dense text boundary points instead of sparse quadrilateral vertices which often suffers from regression errors while dealing with long text lines. Additionally, the multi-scale network extracts and fuses features at different scales concurrently and seamlessly which demonstrates superb tolerance to the text scale variation. Third, we design a mask-guided multi-task network that reliably detects and rectifies scene texts of arbitrary shapes. The proposed network detects text keypoints and landmark points for accurate text detection and rectification. Forth, we propose a novel scene text recognition method I2C2W that is tolerant to geometric and photometric degradation by decomposing scene text recognition into two inter-connected tasks and leveraging the advances of Transformer architecture. Extensive experiments show that the proposed techniques can accurately detect and recognize texts with various lengths, orientations, and shapes from natural scene images.
author2 Lu Shijian
author_facet Lu Shijian
Xue, Chuhui
format Thesis-Doctor of Philosophy
author Xue, Chuhui
author_sort Xue, Chuhui
title Accurate and robust detection and recognition of texts in scene
title_short Accurate and robust detection and recognition of texts in scene
title_full Accurate and robust detection and recognition of texts in scene
title_fullStr Accurate and robust detection and recognition of texts in scene
title_full_unstemmed Accurate and robust detection and recognition of texts in scene
title_sort accurate and robust detection and recognition of texts in scene
publisher Nanyang Technological University
publishDate 2022
url https://hdl.handle.net/10356/157998
_version_ 1735491159158947840