Deep learning methods for scene text detection

Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as c...

Full description

Saved in:

Bibliographic Details
Main Author:	Liu, Zichuan
Other Authors:	Goh Wang Ling
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2020
Subjects:	Engineering::Electrical and electronic engineering Engineering::Computer science and engineering::Computer applications
Online Access:	https://hdl.handle.net/10356/142533
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-142533
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Electrical and electronic engineering Engineering::Computer science and engineering::Computer applications
spellingShingle	Engineering::Electrical and electronic engineering Engineering::Computer science and engineering::Computer applications Liu, Zichuan Deep learning methods for scene text detection
description	Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as cognitive computing, machine translation text-to-speech, and data mining. OCR is usually divided into two sub-tasks, text detection and recognition. In this thesis, we focus on detecting text in photos captured from natural scenes, which is challenging and actively draws attention from machine learning and computer vision society. Our research started by investigating the deficiency of existing deep learning-based scene text detection methods. We found that the performance of nowadays Convolutional Neural Network (CNN) based scene text detection systems is mainly restricted by the limited receptive field of CNNs, the geometrical mismatching between text objects and predefined references, and the rigid representation of a text object. The limited receptive field prevents a CNN from fully perceiving texts with large aspect ratios and varying scales. The geometrical mismatching leads to under-trained or over-trained classifiers and bounding box regressors in a classic anchor-based detection system. The rigid representation reduces the generality of a text detection system to retrieve text instances with various shapes. In this thesis, based on the systematical analysis of problems mentioned above, we proposed two new methods to detect scene texts with arbitrary shapes and orientations. Instead of using a prediction from a single location with limited receptive field to represent the text's objectness, we proposed a new detection framework called Markov Clustering Networks (MCNs) that represents an object with predictions at multiple locations and can extract text objects by Markov Clustering. Different from traditional top-down text detection methods, our MCN can be treated as a bottom-up method. It iteratively aggregates local predictions into a global object representation, which is robust to texts with large aspect ratios and highly varying orientations. Moreover, this reference-less framework does not suffer from performance degradation caused by mismatching, which enjoys much better generality. We further enhance the capability of MCN in capturing long texts by introducing a 2-Dimensional Position-aware Attention mechanism. \par In addition to detection rectangle text objects, we step further to design a method to retrieve curve shape text objects. The main challenge of curve text detection comes from irregular shapes and highly varying orientations. On the one hand, the bounding box representation of a text object does not scale well in the curve scenario, which fails to tightly represent the boundaries of a curve text and results in low recall and precision. On the other hand, irregular shapes and complex layout introduce additional noises in extracted representations by region-based methods, reducing the performance of polygon regression and instance segmentation. To address these problems, we propose a novel Conditional Spatial Expansion (CSE) mechanism to detect text with arbitrary shapes. We explicitly model a curve text object as a 2-dimensional sequence and regard retrieving instance-level curve text region as a conditional prediction problem on the spatial domain. Starting with an arbitrary interior point of a text region, CSE progressively predicts the status of its neighborhoods based on which the related sub-regions are merged into the entire instance region. Our CSE is extremely discriminative, especially when texts are close to each other and provides a controllable approach to extract an expected text region with minimum efforts of post-processing.
author2	Goh Wang Ling
author_facet	Goh Wang Ling Liu, Zichuan
format	Thesis-Doctor of Philosophy
author	Liu, Zichuan
author_sort	Liu, Zichuan
title	Deep learning methods for scene text detection
title_short	Deep learning methods for scene text detection
title_full	Deep learning methods for scene text detection
title_fullStr	Deep learning methods for scene text detection
title_full_unstemmed	Deep learning methods for scene text detection
title_sort	deep learning methods for scene text detection
publisher	Nanyang Technological University
publishDate	2020
url	https://hdl.handle.net/10356/142533
_version_	1772827265482096640
spelling	sg-ntu-dr.10356-1425332023-07-04T17:16:41Z Deep learning methods for scene text detection Liu, Zichuan Goh Wang Ling School of Electrical and Electronic Engineering Lin Guosheng EWLGOH@ntu.edu.sg Engineering::Electrical and electronic engineering Engineering::Computer science and engineering::Computer applications Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as cognitive computing, machine translation text-to-speech, and data mining. OCR is usually divided into two sub-tasks, text detection and recognition. In this thesis, we focus on detecting text in photos captured from natural scenes, which is challenging and actively draws attention from machine learning and computer vision society. Our research started by investigating the deficiency of existing deep learning-based scene text detection methods. We found that the performance of nowadays Convolutional Neural Network (CNN) based scene text detection systems is mainly restricted by the limited receptive field of CNNs, the geometrical mismatching between text objects and predefined references, and the rigid representation of a text object. The limited receptive field prevents a CNN from fully perceiving texts with large aspect ratios and varying scales. The geometrical mismatching leads to under-trained or over-trained classifiers and bounding box regressors in a classic anchor-based detection system. The rigid representation reduces the generality of a text detection system to retrieve text instances with various shapes. In this thesis, based on the systematical analysis of problems mentioned above, we proposed two new methods to detect scene texts with arbitrary shapes and orientations. Instead of using a prediction from a single location with limited receptive field to represent the text's objectness, we proposed a new detection framework called Markov Clustering Networks (MCNs) that represents an object with predictions at multiple locations and can extract text objects by Markov Clustering. Different from traditional top-down text detection methods, our MCN can be treated as a bottom-up method. It iteratively aggregates local predictions into a global object representation, which is robust to texts with large aspect ratios and highly varying orientations. Moreover, this reference-less framework does not suffer from performance degradation caused by mismatching, which enjoys much better generality. We further enhance the capability of MCN in capturing long texts by introducing a 2-Dimensional Position-aware Attention mechanism. \par In addition to detection rectangle text objects, we step further to design a method to retrieve curve shape text objects. The main challenge of curve text detection comes from irregular shapes and highly varying orientations. On the one hand, the bounding box representation of a text object does not scale well in the curve scenario, which fails to tightly represent the boundaries of a curve text and results in low recall and precision. On the other hand, irregular shapes and complex layout introduce additional noises in extracted representations by region-based methods, reducing the performance of polygon regression and instance segmentation. To address these problems, we propose a novel Conditional Spatial Expansion (CSE) mechanism to detect text with arbitrary shapes. We explicitly model a curve text object as a 2-dimensional sequence and regard retrieving instance-level curve text region as a conditional prediction problem on the spatial domain. Starting with an arbitrary interior point of a text region, CSE progressively predicts the status of its neighborhoods based on which the related sub-regions are merged into the entire instance region. Our CSE is extremely discriminative, especially when texts are close to each other and provides a controllable approach to extract an expected text region with minimum efforts of post-processing. Doctor of Philosophy 2020-06-24T01:19:11Z 2020-06-24T01:19:11Z 2020 Thesis-Doctor of Philosophy Liu, Z. (2020). Deep learning methods for scene text detection. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/142533 10.32657/10356/142533 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Deep learning methods for scene text detection

Similar Items