Semi-supervised clustering algorithms for web documents

Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challengin...

Full description

Saved in:
Bibliographic Details
Main Author: Hua, Yunke.
Other Authors: Chen Lihui
Format: Final Year Project
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10356/53348
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-53348
record_format dspace
spelling sg-ntu-dr.10356-533482023-07-07T15:51:05Z Semi-supervised clustering algorithms for web documents Hua, Yunke. Chen Lihui School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challenging task as the cluster patterns are difficult to be found in the high-dimensional space. In this project, a new semi-supervised fuzzy co-clustering algorithm called SSFCR is proposed based on the original fuzzy co-clustering with Ruspini’s condition (FCR) algorithm. Due to the overlapping nature of the real world data, fuzzy clustering is used. Co-clustering is adopted since it can simultaneously clustering the features to dynamically reduce the dimensionality of the object clustering space, which is suitable for clustering high-dimensional data like the web documents. For the semi-supervised method, some prior knowledge in the form of two sets of pair-wise constraints is introduced in the clustering process to improve the accuracy and efficiency. Each constraint specifies whether a pair of documents “must-link”(must be in the same cluster) or “cannot-link”(must be in different clusters) with each other. The categorical label of the pair-wise constraints can be taken from either the ground-truth label information or the user assigned categorical values. The whole clustering process is treated as solving a maximization problem of an aggregation cost function with the semi-supervised terms. By applying the Lagrange multiplier method, the update membership rules for the new semi-supervised SSFCR are derived. Next, extensive experimental study is carried out on several large benchmark datasets using various parameter settings to show the improvement on accuracy, stability and efficiency of the new SSFCR algorithm. Bachelor of Engineering 2013-05-31T07:43:26Z 2013-05-31T07:43:26Z 2013 2013 Final Year Project (FYP) http://hdl.handle.net/10356/53348 en Nanyang Technological University 49 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
spellingShingle DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems
Hua, Yunke.
Semi-supervised clustering algorithms for web documents
description Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challenging task as the cluster patterns are difficult to be found in the high-dimensional space. In this project, a new semi-supervised fuzzy co-clustering algorithm called SSFCR is proposed based on the original fuzzy co-clustering with Ruspini’s condition (FCR) algorithm. Due to the overlapping nature of the real world data, fuzzy clustering is used. Co-clustering is adopted since it can simultaneously clustering the features to dynamically reduce the dimensionality of the object clustering space, which is suitable for clustering high-dimensional data like the web documents. For the semi-supervised method, some prior knowledge in the form of two sets of pair-wise constraints is introduced in the clustering process to improve the accuracy and efficiency. Each constraint specifies whether a pair of documents “must-link”(must be in the same cluster) or “cannot-link”(must be in different clusters) with each other. The categorical label of the pair-wise constraints can be taken from either the ground-truth label information or the user assigned categorical values. The whole clustering process is treated as solving a maximization problem of an aggregation cost function with the semi-supervised terms. By applying the Lagrange multiplier method, the update membership rules for the new semi-supervised SSFCR are derived. Next, extensive experimental study is carried out on several large benchmark datasets using various parameter settings to show the improvement on accuracy, stability and efficiency of the new SSFCR algorithm.
author2 Chen Lihui
author_facet Chen Lihui
Hua, Yunke.
format Final Year Project
author Hua, Yunke.
author_sort Hua, Yunke.
title Semi-supervised clustering algorithms for web documents
title_short Semi-supervised clustering algorithms for web documents
title_full Semi-supervised clustering algorithms for web documents
title_fullStr Semi-supervised clustering algorithms for web documents
title_full_unstemmed Semi-supervised clustering algorithms for web documents
title_sort semi-supervised clustering algorithms for web documents
publishDate 2013
url http://hdl.handle.net/10356/53348
_version_ 1772828933663752192