A proximity-based fuzzy clustering for web mining
Fuzzy C Means Clustering (FCM) is one of the fundamental clustering techniques, which has been widely used for image processing in clustering objects in past 30 years. However, FCM inevitably has some shortcomings, for example it did not take the users' habits or preferences into consideration....
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/55253 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Fuzzy C Means Clustering (FCM) is one of the fundamental clustering techniques, which has been widely used for image processing in clustering objects in past 30 years. However, FCM inevitably has some shortcomings, for example it did not take the users' habits or preferences into consideration. Meanwhile, with regard to the web search service, there has been a substantial gap between what users expect and what users actually get.
Thus, we employ a method called P-FCM, which is a proximity-based fuzzy C-means proposed by W. Pedrycz et al. in 2003[4]. As the name stipulates, the supervision mechanism is realized with a certain number of proximity hints or constraints provided by the users, which specify an extent to which these pairs of pattern are regarded relevant or different. These hints can be considered as a kind of prior knowledge to the clustering process, and externally drive the optimization process into two steps. The first phase comes the standard Fuzzy C-means, and the second phase is the gradient-driven optimization of the differences between the proximity constraints and those computed based on the partition matrix obtained at the first phase of the algorithm. Afterwards, we put forward an improved method, the modified P-HFCM, which uses cosine distance instead of Euclidean distance to represent the relationship between documents.
We simulate two examples of small datasets illustrated in W. Pedrycz's paper by Java and Matlab separately. Besides we observe the performance of P-HFCM (E-Distance) and modified P-HFCM (C-Distance) on several high dimensional datasets with different parameter settings. We set up a series of evaluation methods to measure the behavior of the clustering results compared with the predefined ground truth from various respects and analyze the effects on the clustering results produced by adjusting varying parameters. |
---|