POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA
One of the main problems of Post OCR Error Detection are its shortcoming in detecting error for unique pattern that does not depends on certain language. There are not yet focused research to tackle this, regex or lexicon approach have the lack of adaptability for diverse document and Deep Learni...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/79561 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:79561 |
---|---|
spelling |
id-itb.:795612024-01-10T08:18:22ZPOST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA Ikhsan Saputro, Muhammad Indonesia Theses OCR, Post OCR Error Detection, NCD, gzip, zstd, LZ78, short text, FUNSD+ INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/79561 One of the main problems of Post OCR Error Detection are its shortcoming in detecting error for unique pattern that does not depends on certain language. There are not yet focused research to tackle this, regex or lexicon approach have the lack of adaptability for diverse document and Deep Learning method cost a lot of time for training and use a lot of computation resource. NCD (gzip) kNN offers alternative that compares to other Deep Learning method like BERT without training and high computational cost. The application of NCD (gzip) kNN for error classification for short text data that usually found in form data resulted in significant decline in accuracy than its long text counterpart this is the result of short text and error classification task that is difficult to detect using NCD (gzip), because of that we try other NCD approach using zstd dictionary, this method able to decrease prediction time from 3,5s before to 0,015ms but in terms of accuracy this method is not better than NCD (gzip) kNN. LZ78 custom compression approach then made for the sole purpose on detecting difference between error and non error text, this method able to increase accuracy to 0,69 for short text dataset despite its still below BERT’s accuracy. The use of FUNSD+ is to show implementation of LZ78 custom compression on real form data shows accuracy of 0,745 and precision 0,85 where for OCR’s accuracy 0,511 this method could increase that number to 0,93. The LZ78 custom compression offers Post OCR Error Detection on short text from data form with low computation resource and training time of 0,5s for each label. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
One of the main problems of Post OCR Error Detection are its shortcoming in detecting error
for unique pattern that does not depends on certain language. There are not yet focused
research to tackle this, regex or lexicon approach have the lack of adaptability for diverse
document and Deep Learning method cost a lot of time for training and use a lot of computation
resource. NCD (gzip) kNN offers alternative that compares to other Deep Learning method
like BERT without training and high computational cost. The application of NCD (gzip) kNN
for error classification for short text data that usually found in form data resulted in significant
decline in accuracy than its long text counterpart this is the result of short text and error
classification task that is difficult to detect using NCD (gzip), because of that we try other NCD
approach using zstd dictionary, this method able to decrease prediction time from 3,5s before
to 0,015ms but in terms of accuracy this method is not better than NCD (gzip) kNN. LZ78
custom compression approach then made for the sole purpose on detecting difference between
error and non error text, this method able to increase accuracy to 0,69 for short text dataset
despite its still below BERT’s accuracy. The use of FUNSD+ is to show implementation of
LZ78 custom compression on real form data shows accuracy of 0,745 and precision 0,85 where
for OCR’s accuracy 0,511 this method could increase that number to 0,93. The LZ78 custom
compression offers Post OCR Error Detection on short text from data form with low
computation resource and training time of 0,5s for each label. |
format |
Theses |
author |
Ikhsan Saputro, Muhammad |
spellingShingle |
Ikhsan Saputro, Muhammad POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA |
author_facet |
Ikhsan Saputro, Muhammad |
author_sort |
Ikhsan Saputro, Muhammad |
title |
POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA |
title_short |
POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA |
title_full |
POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA |
title_fullStr |
POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA |
title_full_unstemmed |
POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA |
title_sort |
post-ocr error detection using classification by compression for form text data |
url |
https://digilib.itb.ac.id/gdl/view/79561 |
_version_ |
1822996351797952512 |