Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios

The correction of grammatical errors in natural language processing is a crucial task as it aims to enhance the accuracy and intelligibility of written language. However, developing a grammatical error correction (GEC) framework for low-resource languages presents significant challenges due to the l...

Full description

Saved in:
Bibliographic Details
Main Authors: Zeinab Mahmoud, Chunlin Li, Marco Zappatore, Aiman Solyman, Ali Alfatemi, Ashraf Osman Ibrahim Elsayed, Abdelzahir Abdelmaboud
Format: Article
Language:English
English
Published: PeerJ, Inc. 2023
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/38438/1/ABSTRACT.pdf
https://eprints.ums.edu.my/id/eprint/38438/2/FULL%20TEXT.pdf
https://eprints.ums.edu.my/id/eprint/38438/
http://dx.doi.org/10.7717/peerj-cs.1639
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Malaysia Sabah
Language: English
English
id my.ums.eprints.38438
record_format eprints
spelling my.ums.eprints.384382024-03-05T02:35:15Z https://eprints.ums.edu.my/id/eprint/38438/ Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios Zeinab Mahmoud Chunlin Li Marco Zappatore Aiman Solyman Ali Alfatemi Ashraf Osman Ibrahim Elsayed Abdelzahir Abdelmaboud P1-85 General QA75.5-76.95 Electronic computers. Computer science The correction of grammatical errors in natural language processing is a crucial task as it aims to enhance the accuracy and intelligibility of written language. However, developing a grammatical error correction (GEC) framework for low-resource languages presents significant challenges due to the lack of available training data. This article proposes a novel GEC framework for low-resource languages, using Arabic as a case study. To generate more training data, we propose a semi-supervised confusion method called the equal distribution of synthetic errors (EDSE), which generates a wide range of parallel training data. Additionally, this article addresses two limitations of the classical seq2seq GEC model, which are unbalanced outputs due to the unidirectional decoder and exposure bias during inference. To overcome these limitations, we apply a knowledge distillation technique from neural machine translation. This method utilizes two decoders, a forward decoder right-to-left and a backward decoder left-to-right, and measures their agreement using Kullback-Leibler divergence as a regularization term. The experimental results on two benchmarks demonstrate that our proposed framework outperforms the Transformer baseline and two widely used bidirectional decoding techniques, namely asynchronous and synchronous bidirectional decoding. Furthermore, the proposed framework reported the highest F1 score, and generating synthetic data using the equal distribution technique for syntactic errors resulted in a significant improvement in performance. These findings demonstrate the effectiveness of the proposed framework for improving grammatical error correction for low-resource languages, particularly for the Arabic language. PeerJ, Inc. 2023 Article NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/38438/1/ABSTRACT.pdf text en https://eprints.ums.edu.my/id/eprint/38438/2/FULL%20TEXT.pdf Zeinab Mahmoud and Chunlin Li and Marco Zappatore and Aiman Solyman and Ali Alfatemi and Ashraf Osman Ibrahim Elsayed and Abdelzahir Abdelmaboud (2023) Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios. PeerJ Computer Science. pp. 1-25. ISSN 2376-5992 http://dx.doi.org/10.7717/peerj-cs.1639
institution Universiti Malaysia Sabah
building UMS Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sabah
content_source UMS Institutional Repository
url_provider http://eprints.ums.edu.my/
language English
English
topic P1-85 General
QA75.5-76.95 Electronic computers. Computer science
spellingShingle P1-85 General
QA75.5-76.95 Electronic computers. Computer science
Zeinab Mahmoud
Chunlin Li
Marco Zappatore
Aiman Solyman
Ali Alfatemi
Ashraf Osman Ibrahim Elsayed
Abdelzahir Abdelmaboud
Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios
description The correction of grammatical errors in natural language processing is a crucial task as it aims to enhance the accuracy and intelligibility of written language. However, developing a grammatical error correction (GEC) framework for low-resource languages presents significant challenges due to the lack of available training data. This article proposes a novel GEC framework for low-resource languages, using Arabic as a case study. To generate more training data, we propose a semi-supervised confusion method called the equal distribution of synthetic errors (EDSE), which generates a wide range of parallel training data. Additionally, this article addresses two limitations of the classical seq2seq GEC model, which are unbalanced outputs due to the unidirectional decoder and exposure bias during inference. To overcome these limitations, we apply a knowledge distillation technique from neural machine translation. This method utilizes two decoders, a forward decoder right-to-left and a backward decoder left-to-right, and measures their agreement using Kullback-Leibler divergence as a regularization term. The experimental results on two benchmarks demonstrate that our proposed framework outperforms the Transformer baseline and two widely used bidirectional decoding techniques, namely asynchronous and synchronous bidirectional decoding. Furthermore, the proposed framework reported the highest F1 score, and generating synthetic data using the equal distribution technique for syntactic errors resulted in a significant improvement in performance. These findings demonstrate the effectiveness of the proposed framework for improving grammatical error correction for low-resource languages, particularly for the Arabic language.
format Article
author Zeinab Mahmoud
Chunlin Li
Marco Zappatore
Aiman Solyman
Ali Alfatemi
Ashraf Osman Ibrahim Elsayed
Abdelzahir Abdelmaboud
author_facet Zeinab Mahmoud
Chunlin Li
Marco Zappatore
Aiman Solyman
Ali Alfatemi
Ashraf Osman Ibrahim Elsayed
Abdelzahir Abdelmaboud
author_sort Zeinab Mahmoud
title Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios
title_short Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios
title_full Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios
title_fullStr Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios
title_full_unstemmed Semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios
title_sort semi-supervised learning and bidirectional decoding for effective grammar correction in low-resource scenarios
publisher PeerJ, Inc.
publishDate 2023
url https://eprints.ums.edu.my/id/eprint/38438/1/ABSTRACT.pdf
https://eprints.ums.edu.my/id/eprint/38438/2/FULL%20TEXT.pdf
https://eprints.ums.edu.my/id/eprint/38438/
http://dx.doi.org/10.7717/peerj-cs.1639
_version_ 1793154684239740928