Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC

The languages spoken in Asia share common morphological analysis errors in word segmentation which normally propagate to higher-level processing, i.e., part-of-speech (POS) tagging, syntactic parsing, word extraction, and named entity recognition (NER), as we discuss in this research. We introduce t...

Full description

Saved in:

Bibliographic Details
Main Author:	Sornlertlamvanich V.
Other Authors:	Mahidol University
Format:	Article
Published:	2023
Subjects:	Computer Science
Online Access:	https://repository.li.mahidol.ac.th/handle/123456789/84392
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Mahidol University

id	th-mahidol.84392
record_format	dspace
spelling	th-mahidol.843922023-06-19T00:03:56Z Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC Sornlertlamvanich V. Mahidol University Computer Science The languages spoken in Asia share common morphological analysis errors in word segmentation which normally propagate to higher-level processing, i.e., part-of-speech (POS) tagging, syntactic parsing, word extraction, and named entity recognition (NER), as we discuss in this research. We introduce the Thai character cluster (TCC) to reduce the errors propagated from word segmentation and POS tagging by incorporating it into the character representation layer of bidirectional long short-term memory (BiLSTM) for NER. The initial NER model is created from the original THAI-NEST named-entity (NE) tagged corpus by applying the best performing BiLSTM-CNN-CRF model (the combination of BiLSTM, convolutional neural network (CNN), and conditional random field (CRF)) with the word, POS, and TCC embedding. We determine the errors and improve the consistency of the NE annotation through our holdout method by retraining the model with the corrected training set. After the iteration, the overall result of the annotation F1-score has been improved to reach 89.22%, which improves 16.21% from the model trained on the original corpus. The result of our iterative verification is a promising method for low resource language modeling. As a result, The NE silver standard corpus is newly generated for the Thai NER task, called Bangkok Data NE tagged Corpus (BKD). The consistency of annotation is checked and revised according to the improvement of the scope of NE detection by TCC which can recover the errors in word segmentation. 2023-06-18T17:03:56Z 2023-06-18T17:03:56Z 2022-01-01 Article IEEE Access Vol.10 (2022) , 53043-53052 10.1109/ACCESS.2022.3175201 21693536 2-s2.0-85130489064 https://repository.li.mahidol.ac.th/handle/123456789/84392 SCOPUS
institution	Mahidol University
building	Mahidol University Library
continent	Asia
country	Thailand Thailand
content_provider	Mahidol University Library
collection	Mahidol University Institutional Repository
topic	Computer Science
spellingShingle	Computer Science Sornlertlamvanich V. Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
description	The languages spoken in Asia share common morphological analysis errors in word segmentation which normally propagate to higher-level processing, i.e., part-of-speech (POS) tagging, syntactic parsing, word extraction, and named entity recognition (NER), as we discuss in this research. We introduce the Thai character cluster (TCC) to reduce the errors propagated from word segmentation and POS tagging by incorporating it into the character representation layer of bidirectional long short-term memory (BiLSTM) for NER. The initial NER model is created from the original THAI-NEST named-entity (NE) tagged corpus by applying the best performing BiLSTM-CNN-CRF model (the combination of BiLSTM, convolutional neural network (CNN), and conditional random field (CRF)) with the word, POS, and TCC embedding. We determine the errors and improve the consistency of the NE annotation through our holdout method by retraining the model with the corrected training set. After the iteration, the overall result of the annotation F1-score has been improved to reach 89.22%, which improves 16.21% from the model trained on the original corpus. The result of our iterative verification is a promising method for low resource language modeling. As a result, The NE silver standard corpus is newly generated for the Thai NER task, called Bangkok Data NE tagged Corpus (BKD). The consistency of annotation is checked and revised according to the improvement of the scope of NE detection by TCC which can recover the errors in word segmentation.
author2	Mahidol University
author_facet	Mahidol University Sornlertlamvanich V.
format	Article
author	Sornlertlamvanich V.
author_sort	Sornlertlamvanich V.
title	Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
title_short	Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
title_full	Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
title_fullStr	Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
title_full_unstemmed	Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC
title_sort	thai named entity recognition using bilstm-cnn-crf enhanced by tcc
publishDate	2023
url	https://repository.li.mahidol.ac.th/handle/123456789/84392
_version_	1781415926929817600

Thai Named Entity Recognition Using BiLSTM-CNN-CRF Enhanced by TCC

Similar Items