Training deep network models for accurate detection of texts in scenes

In the scene text detection field, recent deep neural network-based approaches have garnered significant attention due to their impressive results on various benchmark datasets, including ICDAR 2013 [52], ICDAR 2015 [34], and MSRA-TD500 [53]. However, several existing methods for scene text detectio...

Full description

Saved in:
Bibliographic Details
Main Author: Lee, Chun Fei
Other Authors: Lu Shijian
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165885
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In the scene text detection field, recent deep neural network-based approaches have garnered significant attention due to their impressive results on various benchmark datasets, including ICDAR 2013 [52], ICDAR 2015 [34], and MSRA-TD500 [53]. However, several existing methods for scene text detection employ complex pipelines with multiple intermediate steps, such as text-line candidate generation [29], rule-based filtering [29] , and word partitioning [7, 29]. This can lead to increased computational costs and time-consuming processing, ultimately resulting in reduced efficiency and performance degradation [49]. Moreover, the vanishing gradient and overfitting issues pose a significant challenge in scene text detection methods [1, 50]. Low-resolution feature maps also struggle to identify small and barely noticeable text in an image [51]. Therefore, this research aims to address these challenges and enhance existing scene text detection models by incorporating various designs. To address these challenges, we propose four key improvements: First, we refactor a widely used scene text detection method [1] and modify a simple yet efficient pipeline. This will serve as a basis for further enhancements. Second, we incorporate skip links into its feature extractor, effectively preventing vanishing gradient problems. Third, we apply the Feature Pyramid Network (FPN) [2] to eliminate the low-resolution issues. Specifically, we up-sample feature maps at different scales and concatenate them to form a high-resolution feature map. Lastly, we fine-tune the training schedules to avoid the overfitting issue. Concretely, our approach deploys fine-grained learning rates to train the model, enabling it to start from easier concepts to more complex ones. Through extensive experimentation, we demonstrate the robustness and effectiveness of our method. Our method outperforms the original EAST implementation on the ICDAR 2015 dataset by 5.71, achieving an F-score of 82.12. For more implementation details, please check our code at: https://github.com/ChunFei96/EAST_resnet50.