Development of Thai word segmentation technique for solving problems with unknown words

© 2015 IEEE. This research has an objective to develop an efficient technique for Thai word segmentation, especially those nonexistent in dictionaries. The researchers developed a model for Thai word segmentation by relying on grammar and rules to solve the problems with words not found in dictionar...

Full description

Saved in:
Bibliographic Details
Main Authors: Chanin Mahatthanachai, Kanchit Malaivongs, Nuttiya Tantranont, Ekkarat Boonchieng
Format: Conference Proceeding
Published: 2018
Subjects:
Online Access:https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84964341233&origin=inward
http://cmuir.cmu.ac.th/jspui/handle/6653943832/55537
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Chiang Mai University
id th-cmuir.6653943832-55537
record_format dspace
spelling th-cmuir.6653943832-555372018-09-05T02:58:36Z Development of Thai word segmentation technique for solving problems with unknown words Chanin Mahatthanachai Kanchit Malaivongs Nuttiya Tantranont Ekkarat Boonchieng Computer Science Decision Sciences © 2015 IEEE. This research has an objective to develop an efficient technique for Thai word segmentation, especially those nonexistent in dictionaries. The researchers developed a model for Thai word segmentation by relying on grammar and rules to solve the problems with words not found in dictionaries. The model was intended to be used as the best approach of word segmentation, which applied the segmentation technique developed by the researchers called PTTSF (Parsing Thai Text with Syntax and Feature of Word). The system of this technique operates by starting from finding the boundary of each word in Thai sentences. If the system finds a word that does not exist in the dictionary or a meaningless word, it would not be able to solve the problem with the method of longest-matching algorithm. Therefore, rules need to be specified to solve such problems. In this study, 28 rules were created and Digraph method was used to find a pattern of word segmentation with the highest probability based on the grammatical principle. After the procedure of finding boundary of the word, the result from correct word segmentation can be used for further processes. In analyzing efficiency of the system, its accuracy in word segmentation was the main point of concern. The results revealed that the derived mapping technique could solve the problem concerned with segmentation words that do not exist in the dictionary with an average accuracy over 90% of the whole document. However, the researchers encountered with ambiguous words problem. Although this problem rarely occurs, it could affect accuracy of word segmentation. 2018-09-05T02:57:39Z 2018-09-05T02:57:39Z 2016-02-08 Conference Proceeding 2-s2.0-84964341233 10.1109/ICSEC.2015.7401423 https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84964341233&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/55537
institution Chiang Mai University
building Chiang Mai University Library
country Thailand
collection CMU Intellectual Repository
topic Computer Science
Decision Sciences
spellingShingle Computer Science
Decision Sciences
Chanin Mahatthanachai
Kanchit Malaivongs
Nuttiya Tantranont
Ekkarat Boonchieng
Development of Thai word segmentation technique for solving problems with unknown words
description © 2015 IEEE. This research has an objective to develop an efficient technique for Thai word segmentation, especially those nonexistent in dictionaries. The researchers developed a model for Thai word segmentation by relying on grammar and rules to solve the problems with words not found in dictionaries. The model was intended to be used as the best approach of word segmentation, which applied the segmentation technique developed by the researchers called PTTSF (Parsing Thai Text with Syntax and Feature of Word). The system of this technique operates by starting from finding the boundary of each word in Thai sentences. If the system finds a word that does not exist in the dictionary or a meaningless word, it would not be able to solve the problem with the method of longest-matching algorithm. Therefore, rules need to be specified to solve such problems. In this study, 28 rules were created and Digraph method was used to find a pattern of word segmentation with the highest probability based on the grammatical principle. After the procedure of finding boundary of the word, the result from correct word segmentation can be used for further processes. In analyzing efficiency of the system, its accuracy in word segmentation was the main point of concern. The results revealed that the derived mapping technique could solve the problem concerned with segmentation words that do not exist in the dictionary with an average accuracy over 90% of the whole document. However, the researchers encountered with ambiguous words problem. Although this problem rarely occurs, it could affect accuracy of word segmentation.
format Conference Proceeding
author Chanin Mahatthanachai
Kanchit Malaivongs
Nuttiya Tantranont
Ekkarat Boonchieng
author_facet Chanin Mahatthanachai
Kanchit Malaivongs
Nuttiya Tantranont
Ekkarat Boonchieng
author_sort Chanin Mahatthanachai
title Development of Thai word segmentation technique for solving problems with unknown words
title_short Development of Thai word segmentation technique for solving problems with unknown words
title_full Development of Thai word segmentation technique for solving problems with unknown words
title_fullStr Development of Thai word segmentation technique for solving problems with unknown words
title_full_unstemmed Development of Thai word segmentation technique for solving problems with unknown words
title_sort development of thai word segmentation technique for solving problems with unknown words
publishDate 2018
url https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84964341233&origin=inward
http://cmuir.cmu.ac.th/jspui/handle/6653943832/55537
_version_ 1681424524351897600