Efficient text classification

As the digital age pushes forward, data and document size have been increasing rapidly. A more efficient and accurate method of sampling data for training text classifiers is required. We require good samples and not just blind samples from Simple Random Sampling, therefore we experimented on a new...

全面介紹

Saved in:
書目詳細資料
主要作者: Tan, Cheryl Qian Ru.
其他作者: Manoranjan Dash
格式: Final Year Project
語言:English
出版: 2010
主題:
在線閱讀:http://hdl.handle.net/10356/39727
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
機構: Nanyang Technological University
語言: English
實物特徵
總結:As the digital age pushes forward, data and document size have been increasing rapidly. A more efficient and accurate method of sampling data for training text classifiers is required. We require good samples and not just blind samples from Simple Random Sampling, therefore we experimented on a new proposed sampling algorithm – CONCISE. It is a novel sampling algorithm that is proposed for selecting training documents for text classification and experiments showed that it works particularly well with small sampling ratio. Experiments were conducted on the 20 Newsgroup corpus and Reuters 21578 document set using two classifiers SVM and Naïve Bayes classifier. CONCISE is compared with SRS in all experiments and results showed that CONCISE is consistent in accuracy no matter which classifier is used. In all experiments, CONCISE outperforms SRS in all sampling ratios and the accuracy with CONCISE is higher. However, CONCISE requires more running time but the trade off is small compared to the increase in accuracy.