An efficient clustering algorithm in the presence on outlier and doubtful data
v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2015
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/79401/1/MuhamadAliasPFS2015.pdf http://eprints.utm.my/id/eprint/79401/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Teknologi Malaysia |
Language: | English |
Summary: | v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in cluster analysis is needed because usually outliers may appear joined together, which may lead to the wrong structure of clusters. New method of trimming in clustering (TCLUST) known as RTCLUST is proposed in this research that uses some information from TCLUST, partition around medoid (PAM), doubtful cluster and local outlier factor (LOF). TCLUST is a clustering method with constraint on the covariance matrices. For this case the constraint used was the eigenvalues. Spurious outlier model explains how to use the eigenvalues ratio, c for good clustering method. Good clustering is obtained using mean of discriminant. The value of c = 50 is obtained as a better value compared to the previous study c = 1. Trimmed likelihood is then used to determine the trimming proportion, α and number of clusters, k. The next procedure combines the TCLUST and PAM, which is known as MPAM. PAM is used because the mean of silhouette explains the clustering much better. The information obtained from MPAM are c = 50, α , and k. Different sample sizes are also used to test the suitability of MPAM. Mean of discriminant and mean of silhouette are then used to measure the strength of clustering. Trimmed likelihood curve is used again to check the values of α , and k. For the next step, using the doubtful cluster method with c = 50, the method shows the overlapping outliers that exist between clusters. In this case, the data in the overlapping area are classified as doubtful outliers and it is decided that the best threshold is 0.1. Lastly, the LOF is used to differentiate between doubtful outliers and real outliers in overlapping areas. Since LOF can detect real outliers, the deletion of this outlier is mandatory. Again, the mean of discriminant and mean of silhouette are obtained after the deletion of real outliers. A trimmed likelihood curve is then used to obtain the final value for α and k. This new procedure of RTCLUST uses c = 50 and threshold value equals 0.1 to obtain the mean of discriminant and mean of silhouette. To justify RTCLUST, medium sample size with Monte Carlo simulation is done to check the right possibility of combining methods, and therefore the normality of RTCLUST can be checked. Results found that the normality assumption for RTCLUST is fulfilled and Bayesian test can be used to significantly decide the value of k. Results for RTCLUST with having the lowest RMSE value shows that it is better than MPAM and TCLUST for both simulation and real data. |
---|