Deep learning methods for scene text detection
Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as c...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/142533 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Optical Character Recognition (OCR) is a common method to convert typed, hand-written or printed text from documents and scene photos into digital format, which can be edited, searched, stored and displayed. It acts as a uniform information entry for text related higher-level applications, such as cognitive computing, machine translation text-to-speech, and data mining.
OCR is usually divided into two sub-tasks, text detection and recognition. In this thesis, we focus on detecting text in photos captured from natural scenes, which is challenging and actively draws attention from machine learning and computer vision society. Our research started by investigating the deficiency of existing deep learning-based scene text detection methods. We found that the performance of nowadays Convolutional Neural Network (CNN) based scene text detection systems is mainly restricted by the limited receptive field of CNNs, the geometrical mismatching between text objects and predefined references, and the rigid representation of a text object. The limited receptive field prevents a CNN from fully perceiving texts with large aspect ratios and varying scales. The geometrical mismatching leads to under-trained or over-trained classifiers and bounding box regressors in a classic anchor-based detection system. The rigid representation reduces the generality of a text detection system to retrieve text instances with various shapes. In this thesis, based on the systematical analysis of problems mentioned above, we proposed two new methods to detect scene texts with arbitrary shapes and orientations.
Instead of using a prediction from a single location with limited receptive field to represent the text's objectness, we proposed a new detection framework called Markov Clustering Networks (MCNs) that represents an object with predictions at multiple locations and can extract text objects by Markov Clustering. Different from traditional top-down text detection methods, our MCN can be treated as a bottom-up method. It iteratively aggregates local predictions into a global object representation, which is robust to texts with large aspect ratios and highly varying orientations. Moreover, this reference-less framework does not suffer from performance degradation caused by mismatching, which enjoys much better generality. We further enhance the capability of MCN in capturing long texts by introducing a 2-Dimensional Position-aware Attention mechanism. \par
In addition to detection rectangle text objects, we step further to design a method to retrieve curve shape text objects. The main challenge of curve text detection comes from irregular shapes and highly varying orientations. On the one hand, the bounding box representation of a text object does not scale well in the curve scenario, which fails to tightly represent the boundaries of a curve text and results in low recall and precision. On the other hand, irregular shapes and complex layout introduce additional noises in extracted representations by region-based methods, reducing the performance of polygon regression and instance segmentation. To address these problems, we propose a novel Conditional Spatial Expansion (CSE) mechanism to detect text with arbitrary shapes. We explicitly model a curve text object as a 2-dimensional sequence and regard retrieving instance-level curve text region as a conditional prediction problem on the spatial domain. Starting with an arbitrary interior point of a text region, CSE progressively predicts the status of its neighborhoods based on which the related sub-regions are merged into the entire instance region. Our CSE is extremely discriminative, especially when texts are close to each other and provides a controllable approach to extract an expected text region with minimum efforts of post-processing. |
---|