Efficient clustering algorithm for large datasets

Clustering, in data mining, is useful for identifying interesting distributions and discovering groups in the underlying data. Traditional clustering algorithms either favor clusters with similar sizes and spherical shapes, or are very sensitive to outliers. These shortcomings are alleviated in a no...

Full description

Saved in:

Bibliographic Details
Main Author:	Chen, Fangying.
Other Authors:	Chen Lihui
Format:	Final Year Project
Language:	English
Published:	2010
Subjects:	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
Online Access:	http://hdl.handle.net/10356/40791
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-40791
record_format	dspace
spelling	sg-ntu-dr.10356-407912023-07-07T17:09:15Z Efficient clustering algorithm for large datasets Chen, Fangying. Chen Lihui School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Clustering, in data mining, is useful for identifying interesting distributions and discovering groups in the underlying data. Traditional clustering algorithms either favor clusters with similar sizes and spherical shapes, or are very sensitive to outliers. These shortcomings are alleviated in a novel algorithm called CURE which was proposed by some researchers. CURE achieves the improvement by representing each cluster with a constant number of well-scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. In an effort to keep up with the rapid growth in the size of databases, CURE incorporates two techniques, random sampling and partitioning, to cope with large datasets. The tenet of both techniques is to reduce the input size to clustering process in order to fit in the main memory. Nowadays, high dimensional data is commonly found in a wide range of real-life applications, like web documents, transaction data and gene expression data. There is an urge for efficient high dimensional data clustering. In this Final Year Project, CURE algorithm is first implemented for low dimensional data with Java programming language. The program is tested on sample datasets. A series of simulations with different parameter settings are carried out and a parameter sensitivity analysis is performed. After being verified on low dimensional data, the program is modified to deal with high dimensional data. Later, the modified program is tested on high dimensional sample datasets and a parameter analysis is performed as well. The objective of this project is to implement CURE using Java. The implementation details, the testing results and performance evaluation are reported. Bachelor of Engineering 2010-06-22T02:10:52Z 2010-06-22T02:10:52Z 2010 2010 Final Year Project (FYP) http://hdl.handle.net/10356/40791 en Nanyang Technological University 69 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
spellingShingle	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Chen, Fangying. Efficient clustering algorithm for large datasets
description	Clustering, in data mining, is useful for identifying interesting distributions and discovering groups in the underlying data. Traditional clustering algorithms either favor clusters with similar sizes and spherical shapes, or are very sensitive to outliers. These shortcomings are alleviated in a novel algorithm called CURE which was proposed by some researchers. CURE achieves the improvement by representing each cluster with a constant number of well-scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. In an effort to keep up with the rapid growth in the size of databases, CURE incorporates two techniques, random sampling and partitioning, to cope with large datasets. The tenet of both techniques is to reduce the input size to clustering process in order to fit in the main memory. Nowadays, high dimensional data is commonly found in a wide range of real-life applications, like web documents, transaction data and gene expression data. There is an urge for efficient high dimensional data clustering. In this Final Year Project, CURE algorithm is first implemented for low dimensional data with Java programming language. The program is tested on sample datasets. A series of simulations with different parameter settings are carried out and a parameter sensitivity analysis is performed. After being verified on low dimensional data, the program is modified to deal with high dimensional data. Later, the modified program is tested on high dimensional sample datasets and a parameter analysis is performed as well. The objective of this project is to implement CURE using Java. The implementation details, the testing results and performance evaluation are reported.
author2	Chen Lihui
author_facet	Chen Lihui Chen, Fangying.
format	Final Year Project
author	Chen, Fangying.
author_sort	Chen, Fangying.
title	Efficient clustering algorithm for large datasets
title_short	Efficient clustering algorithm for large datasets
title_full	Efficient clustering algorithm for large datasets
title_fullStr	Efficient clustering algorithm for large datasets
title_full_unstemmed	Efficient clustering algorithm for large datasets
title_sort	efficient clustering algorithm for large datasets
publishDate	2010
url	http://hdl.handle.net/10356/40791
_version_	1772828140931907584

Efficient clustering algorithm for large datasets

Similar Items