TranSentCut-transformer based Thai sentence segmentation
We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Article |
Published: |
2023
|
Subjects: | |
Online Access: | https://repository.li.mahidol.ac.th/handle/123456789/87650 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Mahidol University |
id |
th-mahidol.87650 |
---|---|
record_format |
dspace |
spelling |
th-mahidol.876502023-06-27T01:13:00Z TranSentCut-transformer based Thai sentence segmentation Yuenyong S. Mahidol University Multidisciplinary We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for every word in the data. This limits the the applicability and performance of sentence segmentation on open-domain text, because the only high-quality Thai corpus that has sentence boundary and POS labels was constructed mostly from academic articles. Our approach only uses raw text for training and the only labelling required is to separate each sentence into its own line in a text file. This makes new datasets much easier to construct. Comparison with existing methods show that our proposed model is competitive with the most recent state-of-the-art when evaluated on in-domain texts, and improved significantly over existing publicly available libraries when applied to out-of-domain input texts. 2023-06-26T18:12:59Z 2023-06-26T18:12:59Z 2022-05-01 Article Songklanakarin Journal of Science and Technology Vol.44 No.3 (2022) , 852-860 01253395 2-s2.0-85137518799 https://repository.li.mahidol.ac.th/handle/123456789/87650 SCOPUS |
institution |
Mahidol University |
building |
Mahidol University Library |
continent |
Asia |
country |
Thailand Thailand |
content_provider |
Mahidol University Library |
collection |
Mahidol University Institutional Repository |
topic |
Multidisciplinary |
spellingShingle |
Multidisciplinary Yuenyong S. TranSentCut-transformer based Thai sentence segmentation |
description |
We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for every word in the data. This limits the the applicability and performance of sentence segmentation on open-domain text, because the only high-quality Thai corpus that has sentence boundary and POS labels was constructed mostly from academic articles. Our approach only uses raw text for training and the only labelling required is to separate each sentence into its own line in a text file. This makes new datasets much easier to construct. Comparison with existing methods show that our proposed model is competitive with the most recent state-of-the-art when evaluated on in-domain texts, and improved significantly over existing publicly available libraries when applied to out-of-domain input texts. |
author2 |
Mahidol University |
author_facet |
Mahidol University Yuenyong S. |
format |
Article |
author |
Yuenyong S. |
author_sort |
Yuenyong S. |
title |
TranSentCut-transformer based Thai sentence segmentation |
title_short |
TranSentCut-transformer based Thai sentence segmentation |
title_full |
TranSentCut-transformer based Thai sentence segmentation |
title_fullStr |
TranSentCut-transformer based Thai sentence segmentation |
title_full_unstemmed |
TranSentCut-transformer based Thai sentence segmentation |
title_sort |
transentcut-transformer based thai sentence segmentation |
publishDate |
2023 |
url |
https://repository.li.mahidol.ac.th/handle/123456789/87650 |
_version_ |
1781414398800166912 |