Semi-supervised clustering algorithms for web documents

Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challengin...

Full description

Saved in:

Bibliographic Details
Main Author:	Hua, Yunke.
Other Authors:	Chen Lihui
Format:	Final Year Project
Language:	English
Published:	2013
Subjects:	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
Online Access:	http://hdl.handle.net/10356/53348
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-53348
record_format	dspace
spelling	sg-ntu-dr.10356-533482023-07-07T15:51:05Z Semi-supervised clustering algorithms for web documents Hua, Yunke. Chen Lihui School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challenging task as the cluster patterns are difficult to be found in the high-dimensional space. In this project, a new semi-supervised fuzzy co-clustering algorithm called SSFCR is proposed based on the original fuzzy co-clustering with Ruspini’s condition (FCR) algorithm. Due to the overlapping nature of the real world data, fuzzy clustering is used. Co-clustering is adopted since it can simultaneously clustering the features to dynamically reduce the dimensionality of the object clustering space, which is suitable for clustering high-dimensional data like the web documents. For the semi-supervised method, some prior knowledge in the form of two sets of pair-wise constraints is introduced in the clustering process to improve the accuracy and efficiency. Each constraint specifies whether a pair of documents “must-link”(must be in the same cluster) or “cannot-link”(must be in different clusters) with each other. The categorical label of the pair-wise constraints can be taken from either the ground-truth label information or the user assigned categorical values. The whole clustering process is treated as solving a maximization problem of an aggregation cost function with the semi-supervised terms. By applying the Lagrange multiplier method, the update membership rules for the new semi-supervised SSFCR are derived. Next, extensive experimental study is carried out on several large benchmark datasets using various parameter settings to show the improvement on accuracy, stability and efficiency of the new SSFCR algorithm. Bachelor of Engineering 2013-05-31T07:43:26Z 2013-05-31T07:43:26Z 2013 2013 Final Year Project (FYP) http://hdl.handle.net/10356/53348 en Nanyang Technological University 49 p. application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
spellingShingle	DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Hua, Yunke. Semi-supervised clustering algorithms for web documents
description	Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challenging task as the cluster patterns are difficult to be found in the high-dimensional space. In this project, a new semi-supervised fuzzy co-clustering algorithm called SSFCR is proposed based on the original fuzzy co-clustering with Ruspini’s condition (FCR) algorithm. Due to the overlapping nature of the real world data, fuzzy clustering is used. Co-clustering is adopted since it can simultaneously clustering the features to dynamically reduce the dimensionality of the object clustering space, which is suitable for clustering high-dimensional data like the web documents. For the semi-supervised method, some prior knowledge in the form of two sets of pair-wise constraints is introduced in the clustering process to improve the accuracy and efficiency. Each constraint specifies whether a pair of documents “must-link”(must be in the same cluster) or “cannot-link”(must be in different clusters) with each other. The categorical label of the pair-wise constraints can be taken from either the ground-truth label information or the user assigned categorical values. The whole clustering process is treated as solving a maximization problem of an aggregation cost function with the semi-supervised terms. By applying the Lagrange multiplier method, the update membership rules for the new semi-supervised SSFCR are derived. Next, extensive experimental study is carried out on several large benchmark datasets using various parameter settings to show the improvement on accuracy, stability and efficiency of the new SSFCR algorithm.
author2	Chen Lihui
author_facet	Chen Lihui Hua, Yunke.
format	Final Year Project
author	Hua, Yunke.
author_sort	Hua, Yunke.
title	Semi-supervised clustering algorithms for web documents
title_short	Semi-supervised clustering algorithms for web documents
title_full	Semi-supervised clustering algorithms for web documents
title_fullStr	Semi-supervised clustering algorithms for web documents
title_full_unstemmed	Semi-supervised clustering algorithms for web documents
title_sort	semi-supervised clustering algorithms for web documents
publishDate	2013
url	http://hdl.handle.net/10356/53348
_version_	1772828933663752192

Semi-supervised clustering algorithms for web documents

Similar Items