Semi-supervised clustering algorithms for web documents
Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challengin...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2013
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/53348 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-53348 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-533482023-07-07T15:51:05Z Semi-supervised clustering algorithms for web documents Hua, Yunke. Chen Lihui School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challenging task as the cluster patterns are difficult to be found in the high-dimensional space. In this project, a new semi-supervised fuzzy co-clustering algorithm called SSFCR is proposed based on the original fuzzy co-clustering with Ruspini’s condition (FCR) algorithm. Due to the overlapping nature of the real world data, fuzzy clustering is used. Co-clustering is adopted since it can simultaneously clustering the features to dynamically reduce the dimensionality of the object clustering space, which is suitable for clustering high-dimensional data like the web documents. For the semi-supervised method, some prior knowledge in the form of two sets of pair-wise constraints is introduced in the clustering process to improve the accuracy and efficiency. Each constraint specifies whether a pair of documents “must-link”(must be in the same cluster) or “cannot-link”(must be in different clusters) with each other. The categorical label of the pair-wise constraints can be taken from either the ground-truth label information or the user assigned categorical values. The whole clustering process is treated as solving a maximization problem of an aggregation cost function with the semi-supervised terms. By applying the Lagrange multiplier method, the update membership rules for the new semi-supervised SSFCR are derived. Next, extensive experimental study is carried out on several large benchmark datasets using various parameter settings to show the improvement on accuracy, stability and efficiency of the new SSFCR algorithm. Bachelor of Engineering 2013-05-31T07:43:26Z 2013-05-31T07:43:26Z 2013 2013 Final Year Project (FYP) http://hdl.handle.net/10356/53348 en Nanyang Technological University 49 p. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems |
spellingShingle |
DRNTU::Engineering::Electrical and electronic engineering::Computer hardware, software and systems Hua, Yunke. Semi-supervised clustering algorithms for web documents |
description |
Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challenging task as the cluster patterns are difficult to be found in the high-dimensional space.
In this project, a new semi-supervised fuzzy co-clustering algorithm called SSFCR is proposed based on the original fuzzy co-clustering with Ruspini’s condition (FCR) algorithm. Due to the overlapping nature of the real world data, fuzzy clustering is used. Co-clustering is adopted since it can simultaneously clustering the features to dynamically reduce the dimensionality of the object clustering space, which is suitable for clustering high-dimensional data like the web documents. For the semi-supervised method, some prior knowledge in the form of two sets of pair-wise constraints is introduced in the clustering process to improve the accuracy and efficiency. Each constraint specifies whether a pair of documents “must-link”(must be in the same cluster) or “cannot-link”(must be in different clusters) with each other. The categorical label of the pair-wise constraints can be taken from either the ground-truth label information or the user assigned categorical values. The whole clustering process is treated as solving a maximization problem of an aggregation cost function with the semi-supervised terms. By applying the Lagrange multiplier method, the update membership rules for the new semi-supervised SSFCR are derived. Next, extensive experimental study is carried out on several large benchmark datasets using various parameter settings to show the improvement on accuracy, stability and efficiency of the new SSFCR algorithm. |
author2 |
Chen Lihui |
author_facet |
Chen Lihui Hua, Yunke. |
format |
Final Year Project |
author |
Hua, Yunke. |
author_sort |
Hua, Yunke. |
title |
Semi-supervised clustering algorithms for web documents |
title_short |
Semi-supervised clustering algorithms for web documents |
title_full |
Semi-supervised clustering algorithms for web documents |
title_fullStr |
Semi-supervised clustering algorithms for web documents |
title_full_unstemmed |
Semi-supervised clustering algorithms for web documents |
title_sort |
semi-supervised clustering algorithms for web documents |
publishDate |
2013 |
url |
http://hdl.handle.net/10356/53348 |
_version_ |
1772828933663752192 |