FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation
Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8463 https://ink.library.smu.edu.sg/context/sis_research/article/9466/viewcontent/FlaCGEC_av.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9466 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-94662024-01-04T09:42:43Z FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation DU, Hanyue ZHAO, Yike TIAN, Qingyuan WANG, Jiani WANG, Lei LAN, Yunshi LU, Xuesong Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models. 2023-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8463 info:doi/10.1145/3583780.3615119 https://ink.library.smu.edu.sg/context/sis_research/article/9466/viewcontent/FlaCGEC_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Chinese Grammatical Error Correction Deep Learning Fine-grained Linguistic Annotation Asian Studies Databases and Information Systems East Asian Languages and Societies |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Chinese Grammatical Error Correction Deep Learning Fine-grained Linguistic Annotation Asian Studies Databases and Information Systems East Asian Languages and Societies |
spellingShingle |
Chinese Grammatical Error Correction Deep Learning Fine-grained Linguistic Annotation Asian Studies Databases and Information Systems East Asian Languages and Societies DU, Hanyue ZHAO, Yike TIAN, Qingyuan WANG, Jiani WANG, Lei LAN, Yunshi LU, Xuesong FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation |
description |
Chinese Grammatical Error Correction (CGEC) has been attracting growing attention from researchers recently. In spite of the fact that multiple CGEC datasets have been developed to support the research, these datasets lack the ability to provide a deep linguistic topology of grammar errors, which is critical for interpreting and diagnosing CGEC approaches. To address this limitation, we introduce FlaCGEC, which is a new CGEC dataset featured with fine-grained linguistic annotation. Specifically, we collect raw corpus from the linguistic schema defined by Chinese language experts, conduct edits on sentences via rules, and refine generated samples manually, which results in 10k sentences with 78 instantiated grammar points and 3 types of edits. We evaluate various cutting-edge CGEC methods on the proposed FlaCGEC dataset and their unremarkable results indicate that this dataset is challenging in covering a large range of grammatical errors. In addition, we also treat FlaCGEC as a diagnostic dataset for testing generalization skills and conduct a thorough evaluation of existing CGEC models. |
format |
text |
author |
DU, Hanyue ZHAO, Yike TIAN, Qingyuan WANG, Jiani WANG, Lei LAN, Yunshi LU, Xuesong |
author_facet |
DU, Hanyue ZHAO, Yike TIAN, Qingyuan WANG, Jiani WANG, Lei LAN, Yunshi LU, Xuesong |
author_sort |
DU, Hanyue |
title |
FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation |
title_short |
FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation |
title_full |
FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation |
title_fullStr |
FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation |
title_full_unstemmed |
FlaCGEC: A Chinese grammatical error correction dataset with fine-grained linguistic annotation |
title_sort |
flacgec: a chinese grammatical error correction dataset with fine-grained linguistic annotation |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/8463 https://ink.library.smu.edu.sg/context/sis_research/article/9466/viewcontent/FlaCGEC_av.pdf |
_version_ |
1787590774260498432 |