Entropy Weighting K-Means for high-dimensional data analysis

Entropy Weighting K-Means (EWKM) clustering is a new k-means type algorithm for clustering high-dimensional objects in subspaces. In high dimensional data, clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimen...

Full description

Saved in:
Bibliographic Details
Main Author: Leonel Rahman.
Other Authors: Chen Lihui
Format: Final Year Project
Language:English
Published: 2010
Subjects:
Online Access:http://hdl.handle.net/10356/39388
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Entropy Weighting K-Means (EWKM) clustering is a new k-means type algorithm for clustering high-dimensional objects in subspaces. In high dimensional data, clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k-means clustering process. An additional step is added to the k-means clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. In this project the new algorithm is implemented in Java and is also scalable to large data sets.[4] However, in L. Jing’s paper, it computes Euclidian distance as the similarity measurement between any two data points and only test on low dimensional data (2-D). In this project, firstly this algorithm was applied directly to test the data set and then we modified the original EWKM expressions to a revised version by introducing the concept of cosine similarity measure which gives better accuracy to the clustering results. Entropy, purity and NMI score values are calculated and applied as quantitative evaluation measures of the experiment results. We analyze the parameters, provide a further study on its advantage, and compare the effectiveness with simple K-Means and original EWKM as well.