Enhanced subspace clustering

Subspace clustering overcomes the curse of dimensionality that traditional clustering suffered, by finding groups of objects that are homogeneous in subspaces of the data, instead of the full space. Research on basic subspace clustering over the past decade primary focuses on finding groups of objec...

Full description

Saved in:
Bibliographic Details
Main Author: Sim, Kelvin Sian Hui.
Other Authors: Vivekanand Gopalkrishnan
Format: Theses and Dissertations
Language:English
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/10356/48647
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Subspace clustering overcomes the curse of dimensionality that traditional clustering suffered, by finding groups of objects that are homogeneous in subspaces of the data, instead of the full space. Research on basic subspace clustering over the past decade primary focuses on finding groups of objects that are closed together in subspaces of 2D data. Recently, the proliferation of non-traditional data and the need for higher quality clustering results have shifted the research paradigm to enhanced subspace clustering, which focuses on problems that cannot be handled or solved effectively through basic subspace clustering. The problems of enhanced subspace clustering can be categorized into two main groups, handling non-traditional data and improving clustering results. We give a survey on the enhanced subspace clustering problems, desired properties that these problems sought in their solutions, and the existing solutions. We study three main problems of enhanced subspace clustering on 2D and 3D datasets: mining subspace clusters in noisy data, mining significant subspace clusters and mining semi-supervised subspace clusters. For mining subspace clusters in noisy data, we found several problems of existing approaches, such as mining incomplete and unstable results, lacking the ability to handle 3D data, and mining clusters that are non-maximal and that contain skewed noise. We propose subspace clusters that are maximal and do not contain skewed noise. We also develop algorithms which exploit the anti-monotone property of the clusters to efficiently mine the complete and stable set of results. We show the effectiveness of our solution in mining biologically significant protein clusters in protein-protein interaction data, which is notoriously noisy in nature. For mining significant subspace clusters, we formulate an information theory concept known as correlation information, to measure the significance of the subspace clusters. We propose mining subspace clusters with high correlation information, and we develop an algorithm which uses the concept of rarity to mine significant 3D subspace clusters in a parameter-insensitive way. We show the effectiveness of our solution in finding significant (1) groups of proteins in protein-protein interaction data, (2) clusters of words and documents in word-document data and (3) in classifying an insurance data, where significant clusters are used as rules of the classifier. For mining semi-supervised subspace clusters, we propose actionable subspace clusters, which are semi-supervised subspace clusters that allow incorporation of user's knowledge, and can suggest beneficial actions to the users. We develop algorithms that use augmented Lagrangian multiplier method coupled with frequent itemset mining algorithm to efficiently mine the actionable clusters in a parameter-insensitive way. We show the effectiveness of our solution in finding actionable groups of residues in protein structural data, which are potential binding sites for drug molecules. Lastly, we present a financial data mining application on value investing, and show that our proposed algorithms outperform a famous value investment strategy in 70% of the experiments.