Deep learning methods for scene text detection

Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as c...

Full description

Saved in:
Bibliographic Details
Main Author: Liu, Zichuan
Other Authors: Goh Wang Ling
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/142533
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-142533
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
Engineering::Computer science and engineering::Computer applications
spellingShingle Engineering::Electrical and electronic engineering
Engineering::Computer science and engineering::Computer applications
Liu, Zichuan
Deep learning methods for scene text detection
description Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as cognitive computing, machine translation text-to-speech, and data mining. OCR is usually divided into two sub-tasks, text detection and recognition. In this thesis, we focus on detecting text in photos captured from natural scenes, which is challenging and actively draws attention from machine learning and computer vision society. Our research started by investigating the deficiency of existing deep learning-based scene text detection methods. We found that the performance of nowadays Convolutional Neural Network (CNN) based scene text detection systems is mainly restricted by the limited receptive field of CNNs, the geometrical mismatching between text objects and predefined references, and the rigid representation of a text object. The limited receptive field prevents a CNN from fully perceiving texts with large aspect ratios and varying scales. The geometrical mismatching leads to under-trained or over-trained classifiers and bounding box regressors in a classic anchor-based detection system. The rigid representation reduces the generality of a text detection system to retrieve text instances with various shapes. In this thesis, based on the systematical analysis of problems mentioned above, we proposed two new methods to detect scene texts with arbitrary shapes and orientations. Instead of using a prediction from a single location with limited receptive field to represent the text's objectness, we proposed a new detection framework called Markov Clustering Networks (MCNs) that represents an object with predictions at multiple locations and can extract text objects by Markov Clustering. Different from traditional top-down text detection methods, our MCN can be treated as a bottom-up method. It iteratively aggregates local predictions into a global object representation, which is robust to texts with large aspect ratios and highly varying orientations. Moreover, this reference-less framework does not suffer from performance degradation caused by mismatching, which enjoys much better generality. We further enhance the capability of MCN in capturing long texts by introducing a 2-Dimensional Position-aware Attention mechanism. \par In addition to detection rectangle text objects, we step further to design a method to retrieve curve shape text objects. The main challenge of curve text detection comes from irregular shapes and highly varying orientations. On the one hand, the bounding box representation of a text object does not scale well in the curve scenario, which fails to tightly represent the boundaries of a curve text and results in low recall and precision. On the other hand, irregular shapes and complex layout introduce additional noises in extracted representations by region-based methods, reducing the performance of polygon regression and instance segmentation. To address these problems, we propose a novel Conditional Spatial Expansion (CSE) mechanism to detect text with arbitrary shapes. We explicitly model a curve text object as a 2-dimensional sequence and regard retrieving instance-level curve text region as a conditional prediction problem on the spatial domain. Starting with an arbitrary interior point of a text region, CSE progressively predicts the status of its neighborhoods based on which the related sub-regions are merged into the entire instance region. Our CSE is extremely discriminative, especially when texts are close to each other and provides a controllable approach to extract an expected text region with minimum efforts of post-processing.
author2 Goh Wang Ling
author_facet Goh Wang Ling
Liu, Zichuan
format Thesis-Doctor of Philosophy
author Liu, Zichuan
author_sort Liu, Zichuan
title Deep learning methods for scene text detection
title_short Deep learning methods for scene text detection
title_full Deep learning methods for scene text detection
title_fullStr Deep learning methods for scene text detection
title_full_unstemmed Deep learning methods for scene text detection
title_sort deep learning methods for scene text detection
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/142533
_version_ 1772827265482096640
spelling sg-ntu-dr.10356-1425332023-07-04T17:16:41Z Deep learning methods for scene text detection Liu, Zichuan Goh Wang Ling School of Electrical and Electronic Engineering Lin Guosheng EWLGOH@ntu.edu.sg Engineering::Electrical and electronic engineering Engineering::Computer science and engineering::Computer applications Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as cognitive computing, machine translation text-to-speech, and data mining. OCR is usually divided into two sub-tasks, text detection and recognition. In this thesis, we focus on detecting text in photos captured from natural scenes, which is challenging and actively draws attention from machine learning and computer vision society. Our research started by investigating the deficiency of existing deep learning-based scene text detection methods. We found that the performance of nowadays Convolutional Neural Network (CNN) based scene text detection systems is mainly restricted by the limited receptive field of CNNs, the geometrical mismatching between text objects and predefined references, and the rigid representation of a text object. The limited receptive field prevents a CNN from fully perceiving texts with large aspect ratios and varying scales. The geometrical mismatching leads to under-trained or over-trained classifiers and bounding box regressors in a classic anchor-based detection system. The rigid representation reduces the generality of a text detection system to retrieve text instances with various shapes. In this thesis, based on the systematical analysis of problems mentioned above, we proposed two new methods to detect scene texts with arbitrary shapes and orientations. Instead of using a prediction from a single location with limited receptive field to represent the text's objectness, we proposed a new detection framework called Markov Clustering Networks (MCNs) that represents an object with predictions at multiple locations and can extract text objects by Markov Clustering. Different from traditional top-down text detection methods, our MCN can be treated as a bottom-up method. It iteratively aggregates local predictions into a global object representation, which is robust to texts with large aspect ratios and highly varying orientations. Moreover, this reference-less framework does not suffer from performance degradation caused by mismatching, which enjoys much better generality. We further enhance the capability of MCN in capturing long texts by introducing a 2-Dimensional Position-aware Attention mechanism. \par In addition to detection rectangle text objects, we step further to design a method to retrieve curve shape text objects. The main challenge of curve text detection comes from irregular shapes and highly varying orientations. On the one hand, the bounding box representation of a text object does not scale well in the curve scenario, which fails to tightly represent the boundaries of a curve text and results in low recall and precision. On the other hand, irregular shapes and complex layout introduce additional noises in extracted representations by region-based methods, reducing the performance of polygon regression and instance segmentation. To address these problems, we propose a novel Conditional Spatial Expansion (CSE) mechanism to detect text with arbitrary shapes. We explicitly model a curve text object as a 2-dimensional sequence and regard retrieving instance-level curve text region as a conditional prediction problem on the spatial domain. Starting with an arbitrary interior point of a text region, CSE progressively predicts the status of its neighborhoods based on which the related sub-regions are merged into the entire instance region. Our CSE is extremely discriminative, especially when texts are close to each other and provides a controllable approach to extract an expected text region with minimum efforts of post-processing. Doctor of Philosophy 2020-06-24T01:19:11Z 2020-06-24T01:19:11Z 2020 Thesis-Doctor of Philosophy Liu, Z. (2020). Deep learning methods for scene text detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/142533 10.32657/10356/142533 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University