Semi-supervised clustering algorithms for web documents

Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challengin...

Full description

Saved in:
Bibliographic Details
Main Author: Hua, Yunke.
Other Authors: Chen Lihui
Format: Final Year Project
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10356/53348
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Clustering is one of the most popular data mining techniques in order to finding the user-desired pattern accurately and efficiently from huge amount of data flow. However, due to the curse of dimensionality, clustering high-dimensional data like web documents and biological data can be a challenging task as the cluster patterns are difficult to be found in the high-dimensional space. In this project, a new semi-supervised fuzzy co-clustering algorithm called SSFCR is proposed based on the original fuzzy co-clustering with Ruspini’s condition (FCR) algorithm. Due to the overlapping nature of the real world data, fuzzy clustering is used. Co-clustering is adopted since it can simultaneously clustering the features to dynamically reduce the dimensionality of the object clustering space, which is suitable for clustering high-dimensional data like the web documents. For the semi-supervised method, some prior knowledge in the form of two sets of pair-wise constraints is introduced in the clustering process to improve the accuracy and efficiency. Each constraint specifies whether a pair of documents “must-link”(must be in the same cluster) or “cannot-link”(must be in different clusters) with each other. The categorical label of the pair-wise constraints can be taken from either the ground-truth label information or the user assigned categorical values. The whole clustering process is treated as solving a maximization problem of an aggregation cost function with the semi-supervised terms. By applying the Lagrange multiplier method, the update membership rules for the new semi-supervised SSFCR are derived. Next, extensive experimental study is carried out on several large benchmark datasets using various parameter settings to show the improvement on accuracy, stability and efficiency of the new SSFCR algorithm.