Developing a new statistical method for Chinese text segmentation
A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentence...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Published: |
2008
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/2614 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
id |
sg-ntu-dr.10356-2614 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-26142023-03-04T00:38:07Z Developing a new statistical method for Chinese text segmentation Dai, Yubin Khoo, Christopher Soo Guan School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentences were used for model building and 100 sentences were set aside for model validation and evaluation. Relative frequencies, document frequencies, weighted document frequencies, within-document frequencies of characters, bigrams and trigrams were included in the study. Master of Applied Science 2008-09-17T09:06:16Z 2008-09-17T09:06:16Z 1999 1999 Thesis http://hdl.handle.net/10356/2614 Nanyang Technological University application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
topic |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity |
spellingShingle |
DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity Dai, Yubin Developing a new statistical method for Chinese text segmentation |
description |
A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentences were used for model building and 100 sentences were set aside for model validation and evaluation. Relative frequencies, document frequencies, weighted document frequencies, within-document frequencies of characters, bigrams and trigrams were included in the study. |
author2 |
Khoo, Christopher Soo Guan |
author_facet |
Khoo, Christopher Soo Guan Dai, Yubin |
format |
Theses and Dissertations |
author |
Dai, Yubin |
author_sort |
Dai, Yubin |
title |
Developing a new statistical method for Chinese text segmentation |
title_short |
Developing a new statistical method for Chinese text segmentation |
title_full |
Developing a new statistical method for Chinese text segmentation |
title_fullStr |
Developing a new statistical method for Chinese text segmentation |
title_full_unstemmed |
Developing a new statistical method for Chinese text segmentation |
title_sort |
developing a new statistical method for chinese text segmentation |
publishDate |
2008 |
url |
http://hdl.handle.net/10356/2614 |
_version_ |
1759855961030262784 |