TranSentCut-transformer based Thai sentence segmentation

We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for...

Full description

Saved in:
Bibliographic Details
Main Author: Yuenyong S.
Other Authors: Mahidol University
Format: Article
Published: 2023
Subjects:
Online Access:https://repository.li.mahidol.ac.th/handle/123456789/87650
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Mahidol University
id th-mahidol.87650
record_format dspace
spelling th-mahidol.876502023-06-27T01:13:00Z TranSentCut-transformer based Thai sentence segmentation Yuenyong S. Mahidol University Multidisciplinary We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for every word in the data. This limits the the applicability and performance of sentence segmentation on open-domain text, because the only high-quality Thai corpus that has sentence boundary and POS labels was constructed mostly from academic articles. Our approach only uses raw text for training and the only labelling required is to separate each sentence into its own line in a text file. This makes new datasets much easier to construct. Comparison with existing methods show that our proposed model is competitive with the most recent state-of-the-art when evaluated on in-domain texts, and improved significantly over existing publicly available libraries when applied to out-of-domain input texts. 2023-06-26T18:12:59Z 2023-06-26T18:12:59Z 2022-05-01 Article Songklanakarin Journal of Science and Technology Vol.44 No.3 (2022) , 852-860 01253395 2-s2.0-85137518799 https://repository.li.mahidol.ac.th/handle/123456789/87650 SCOPUS
institution Mahidol University
building Mahidol University Library
continent Asia
country Thailand
Thailand
content_provider Mahidol University Library
collection Mahidol University Institutional Repository
topic Multidisciplinary
spellingShingle Multidisciplinary
Yuenyong S.
TranSentCut-transformer based Thai sentence segmentation
description We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentence segmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods make use of POS tags, which is not easy to label and must be done for every word in the data. This limits the the applicability and performance of sentence segmentation on open-domain text, because the only high-quality Thai corpus that has sentence boundary and POS labels was constructed mostly from academic articles. Our approach only uses raw text for training and the only labelling required is to separate each sentence into its own line in a text file. This makes new datasets much easier to construct. Comparison with existing methods show that our proposed model is competitive with the most recent state-of-the-art when evaluated on in-domain texts, and improved significantly over existing publicly available libraries when applied to out-of-domain input texts.
author2 Mahidol University
author_facet Mahidol University
Yuenyong S.
format Article
author Yuenyong S.
author_sort Yuenyong S.
title TranSentCut-transformer based Thai sentence segmentation
title_short TranSentCut-transformer based Thai sentence segmentation
title_full TranSentCut-transformer based Thai sentence segmentation
title_fullStr TranSentCut-transformer based Thai sentence segmentation
title_full_unstemmed TranSentCut-transformer based Thai sentence segmentation
title_sort transentcut-transformer based thai sentence segmentation
publishDate 2023
url https://repository.li.mahidol.ac.th/handle/123456789/87650
_version_ 1781414398800166912