Developing a new statistical method for Chinese text segmentation

A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentence...

Full description

Saved in:
Bibliographic Details
Main Author: Dai, Yubin
Other Authors: Khoo, Christopher Soo Guan
Format: Theses and Dissertations
Published: 2008
Subjects:
Online Access:http://hdl.handle.net/10356/2614
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
id sg-ntu-dr.10356-2614
record_format dspace
spelling sg-ntu-dr.10356-26142023-03-04T00:38:07Z Developing a new statistical method for Chinese text segmentation Dai, Yubin Khoo, Christopher Soo Guan School of Computer Engineering DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentences were used for model building and 100 sentences were set aside for model validation and evaluation. Relative frequencies, document frequencies, weighted document frequencies, within-document frequencies of characters, bigrams and trigrams were included in the study. Master of Applied Science 2008-09-17T09:06:16Z 2008-09-17T09:06:16Z 1999 1999 Thesis http://hdl.handle.net/10356/2614 Nanyang Technological University application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
DRNTU::Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition
DRNTU::Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity
Dai, Yubin
Developing a new statistical method for Chinese text segmentation
description A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentences were used for model building and 100 sentences were set aside for model validation and evaluation. Relative frequencies, document frequencies, weighted document frequencies, within-document frequencies of characters, bigrams and trigrams were included in the study.
author2 Khoo, Christopher Soo Guan
author_facet Khoo, Christopher Soo Guan
Dai, Yubin
format Theses and Dissertations
author Dai, Yubin
author_sort Dai, Yubin
title Developing a new statistical method for Chinese text segmentation
title_short Developing a new statistical method for Chinese text segmentation
title_full Developing a new statistical method for Chinese text segmentation
title_fullStr Developing a new statistical method for Chinese text segmentation
title_full_unstemmed Developing a new statistical method for Chinese text segmentation
title_sort developing a new statistical method for chinese text segmentation
publishDate 2008
url http://hdl.handle.net/10356/2614
_version_ 1759855961030262784