Entropy Weighting K-Means for high-dimensional data analysis
Entropy Weighting K-Means (EWKM) clustering is a new k-means type algorithm for clustering high-dimensional objects in subspaces. In high dimensional data, clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimen...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2010
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/39388 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Entropy Weighting K-Means (EWKM) clustering is a new k-means type algorithm for clustering high-dimensional objects in subspaces. In high dimensional data, clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k-means clustering process. An additional step is added to the k-means clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. In this project the new algorithm is implemented in Java and is also scalable to large data sets.[4] However, in L. Jing’s paper, it computes Euclidian distance as the similarity measurement between any two data points and only test on low dimensional data (2-D). In this project, firstly this algorithm was applied directly to test the data set and then we modified the original EWKM expressions to a revised version by introducing the concept of cosine similarity measure which gives better accuracy to the clustering results. Entropy, purity and NMI score values are calculated and applied as quantitative evaluation measures of the experiment results. We analyze the parameters, provide a further study on its advantage, and compare the effectiveness with simple K-Means and original EWKM as well. |
---|