Selecting training samples from large and noisy corpora for efficient text classification

59 p.

Saved in:
Bibliographic Details
Main Author: Wong, Daji
Other Authors: Manoranjan Dash
Format: Theses and Dissertations
Published: 2011
Subjects:
Online Access:http://hdl.handle.net/10356/47535
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
id sg-ntu-dr.10356-47535
record_format dspace
spelling sg-ntu-dr.10356-475352019-12-10T13:02:26Z Selecting training samples from large and noisy corpora for efficient text classification Wong, Daji Manoranjan Dash Wee Kim Wee School of Communication and Information DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing 59 p. In this thesis, an algorithm is presented that selects samples of documents for training text classifiers. Often the number of documents is very large and the documents are noisy. Both for efficiency purposes and accuracy purposes, one need good samples not just blind samples such as that of simple random sampling. The proposed algorithm is far superior to simple random sampling both for small sampling ratios and in the presence of noise. The proposed algorithm is based on a simple fact that the terms in the set of training sample documents should have approximately equal document frequency as in the whole set (not including the test set). Master of Science (Information Studies) 2011-12-27T08:36:21Z 2011-12-27T08:36:21Z 2009 2009 Thesis http://hdl.handle.net/10356/47535 Nanyang Technological University application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
topic DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
spellingShingle DRNTU::Engineering::Computer science and engineering::Computing methodologies::Document and text processing
Wong, Daji
Selecting training samples from large and noisy corpora for efficient text classification
description 59 p.
author2 Manoranjan Dash
author_facet Manoranjan Dash
Wong, Daji
format Theses and Dissertations
author Wong, Daji
author_sort Wong, Daji
title Selecting training samples from large and noisy corpora for efficient text classification
title_short Selecting training samples from large and noisy corpora for efficient text classification
title_full Selecting training samples from large and noisy corpora for efficient text classification
title_fullStr Selecting training samples from large and noisy corpora for efficient text classification
title_full_unstemmed Selecting training samples from large and noisy corpora for efficient text classification
title_sort selecting training samples from large and noisy corpora for efficient text classification
publishDate 2011
url http://hdl.handle.net/10356/47535
_version_ 1681049408972521472