Developing a new statistical method for Chinese text segmentation

A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentence...

Full description

Saved in:

Bibliographic Details
Main Author:	Dai, Yubin
Other Authors:	Khoo, Christopher Soo Guan
Format:	Theses and Dissertations
Published:	2008
Subjects:	DRNTU::Engineering::Computer science and engineering::Computing methodologies::Pattern recognition DRNTU::Engineering::Computer science and engineering::Theory of computation::Analysis of algorithms and problem complexity
Online Access:	http://hdl.handle.net/10356/2614
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University

Description
Summary:	A new statistical formula for Chinese text segmentation called Contextual Information Formula (OF) was developed empirically for identifying 2 and 3-character words. It was developed by performing stepwise logistic regression using a sample of sentences that had been manually segmented. 300 sentences were used for model building and 100 sentences were set aside for model validation and evaluation. Relative frequencies, document frequencies, weighted document frequencies, within-document frequencies of characters, bigrams and trigrams were included in the study.

Developing a new statistical method for Chinese text segmentation

Similar Items