POST-OCR ERROR DETECTION USING CLASSIFICATION BY COMPRESSION FOR FORM TEXT DATA

One of the main problems of Post OCR Error Detection are its shortcoming in detecting error for unique pattern that does not depends on certain language. There are not yet focused research to tackle this, regex or lexicon approach have the lack of adaptability for diverse document and Deep Learni...

Full description

Saved in:
Bibliographic Details
Main Author: Ikhsan Saputro, Muhammad
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/79561
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:One of the main problems of Post OCR Error Detection are its shortcoming in detecting error for unique pattern that does not depends on certain language. There are not yet focused research to tackle this, regex or lexicon approach have the lack of adaptability for diverse document and Deep Learning method cost a lot of time for training and use a lot of computation resource. NCD (gzip) kNN offers alternative that compares to other Deep Learning method like BERT without training and high computational cost. The application of NCD (gzip) kNN for error classification for short text data that usually found in form data resulted in significant decline in accuracy than its long text counterpart this is the result of short text and error classification task that is difficult to detect using NCD (gzip), because of that we try other NCD approach using zstd dictionary, this method able to decrease prediction time from 3,5s before to 0,015ms but in terms of accuracy this method is not better than NCD (gzip) kNN. LZ78 custom compression approach then made for the sole purpose on detecting difference between error and non error text, this method able to increase accuracy to 0,69 for short text dataset despite its still below BERT’s accuracy. The use of FUNSD+ is to show implementation of LZ78 custom compression on real form data shows accuracy of 0,745 and precision 0,85 where for OCR’s accuracy 0,511 this method could increase that number to 0,93. The LZ78 custom compression offers Post OCR Error Detection on short text from data form with low computation resource and training time of 0,5s for each label.