Challenging issues in classification problems : sparisty control, key instance detection, and imbalanced data

This thesis deals with the difficulties in classification problems caused by three types of sparsity characteristics - feature, label, and instance sparsity. First, feature spar- sity is usually used as prior knowledge by inducing parameter sparsity of the learned model. We show that only an appropr...

全面介紹

Saved in:
書目詳細資料
主要作者: Liu, Guoqing.
其他作者: School of Computer Engineering
格式: Theses and Dissertations
語言:English
出版: 2013
主題:
在線閱讀:http://hdl.handle.net/10356/52422
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
實物特徵
總結:This thesis deals with the difficulties in classification problems caused by three types of sparsity characteristics - feature, label, and instance sparsity. First, feature spar- sity is usually used as prior knowledge by inducing parameter sparsity of the learned model. We show that only an appropriate degree of parameter sparsity is beneficial, and both over-sparsity and under-sparsity are harmful for classification. Second, label sparsity means that only a fraction of training instances are labeled, which causes fail- ure of classic classification methods in these cases. Third, instance sparsity is caused by imbalanced composition of different categories, and instances from one category significantly outnumber the ones from the other. This always makes the classification boundary biased towards the majority category. Consequently, three contributions - sparsity control, key instance detection, and imbal- anced classification - are presented to address these challenges. Sparsity control aims to regularize the sparsity of model parameter at an appropriate level according to the intrinsic feature sparsity in data. It is proposed based on the ob- servation that this sparsity is not always desirable in real problems, and only a proper de- gree of sparsity is beneficial. To address this issue, we propose a novel probit classifier using generalized Gaussian scale mixture (GGSM) priors that can adjust the induced sparsity by tuning the shape parameter of GGSM, and consequently provide either a sparse or non-sparse solution based on the intrinsic feature sparsity. Model learning is carried out by an efficient modified maximum a posteriori estimation. We show rela- tionships of the proposed approach to the previous methods. We also study different types of likelihood working with the GGSM priors in a kernel-based setup, based on which an improved kernel-based approach is presented. Experiments demonstrate that the proposed method has better or comparable performance in both linear and non-linear classification.