POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA

One of the main problems of Post OCR Error Detection are its shortcoming in detecting error for unique pattern that does not depends on certain language. There are not yet focused research to tackle this, regex or lexicon approach have the lack of adaptability for diverse document and Deep Learni...

Full description

Saved in:

Bibliographic Details
Main Author:	Ikhsan Saputro, Muhammad
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/79561
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:79561
spelling	id-itb.:795612024-01-10T08:18:22ZPOST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA Ikhsan Saputro, Muhammad Indonesia Theses OCR, Post OCR Error Detection, NCD, gzip, zstd, LZ78, short text, FUNSD+ INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/79561 One of the main problems of Post OCR Error Detection are its shortcoming in detecting error for unique pattern that does not depends on certain language. There are not yet focused research to tackle this, regex or lexicon approach have the lack of adaptability for diverse document and Deep Learning method cost a lot of time for training and use a lot of computation resource. NCD (gzip) kNN offers alternative that compares to other Deep Learning method like BERT without training and high computational cost. The application of NCD (gzip) kNN for error classification for short text data that usually found in form data resulted in significant decline in accuracy than its long text counterpart this is the result of short text and error classification task that is difficult to detect using NCD (gzip), because of that we try other NCD approach using zstd dictionary, this method able to decrease prediction time from 3,5s before to 0,015ms but in terms of accuracy this method is not better than NCD (gzip) kNN. LZ78 custom compression approach then made for the sole purpose on detecting difference between error and non error text, this method able to increase accuracy to 0,69 for short text dataset despite its still below BERT’s accuracy. The use of FUNSD+ is to show implementation of LZ78 custom compression on real form data shows accuracy of 0,745 and precision 0,85 where for OCR’s accuracy 0,511 this method could increase that number to 0,93. The LZ78 custom compression offers Post OCR Error Detection on short text from data form with low computation resource and training time of 0,5s for each label. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	One of the main problems of Post OCR Error Detection are its shortcoming in detecting error for unique pattern that does not depends on certain language. There are not yet focused research to tackle this, regex or lexicon approach have the lack of adaptability for diverse document and Deep Learning method cost a lot of time for training and use a lot of computation resource. NCD (gzip) kNN offers alternative that compares to other Deep Learning method like BERT without training and high computational cost. The application of NCD (gzip) kNN for error classification for short text data that usually found in form data resulted in significant decline in accuracy than its long text counterpart this is the result of short text and error classification task that is difficult to detect using NCD (gzip), because of that we try other NCD approach using zstd dictionary, this method able to decrease prediction time from 3,5s before to 0,015ms but in terms of accuracy this method is not better than NCD (gzip) kNN. LZ78 custom compression approach then made for the sole purpose on detecting difference between error and non error text, this method able to increase accuracy to 0,69 for short text dataset despite its still below BERT’s accuracy. The use of FUNSD+ is to show implementation of LZ78 custom compression on real form data shows accuracy of 0,745 and precision 0,85 where for OCR’s accuracy 0,511 this method could increase that number to 0,93. The LZ78 custom compression offers Post OCR Error Detection on short text from data form with low computation resource and training time of 0,5s for each label.
format	Theses
author	Ikhsan Saputro, Muhammad
spellingShingle	Ikhsan Saputro, Muhammad POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA
author_facet	Ikhsan Saputro, Muhammad
author_sort	Ikhsan Saputro, Muhammad
title	POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA
title_short	POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA
title_full	POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA
title_fullStr	POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA
title_full_unstemmed	POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA
title_sort	post-ocr error detection using classification by compression for form text data
url	https://digilib.itb.ac.id/gdl/view/79561
_version_	1822996351797952512

POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA

Similar Items