The old newspaper project

Optical Character Recognition (OCR) is commonly used nowadays for printouts and documents conversion in sociology, communication and education studies. In traditional OCR models, texts are extracted sequentially within the whole page. In the case of newspaper, texts are arranged in columns based on...

Full description

Saved in:
Bibliographic Details
Main Author: Mao, Junke
Other Authors: Ling Keck Voon
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/157550
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Optical Character Recognition (OCR) is commonly used nowadays for printouts and documents conversion in sociology, communication and education studies. In traditional OCR models, texts are extracted sequentially within the whole page. In the case of newspaper, texts are arranged in columns based on articles with images embedded. As a result, the conversion of text materials with such a complex layout, such as multi-column text, headlines, embedded figures, etc, might impair the outcomes of the OCR results. To improve the efficiency of converting images of newspapers, we built a specialized model for newspaper recognition. The integrated model will perform object segmentation to extract the relevant components in the image, i.e., the headlines, embedded figures, etc, and performs OCR on these components accordingly. The output would be text document logically arranged with headlines, text body in single column, and embedded images appended at the end.