Clustering techniques for web documents

Document clustering is a process of grouping documents into several natural and homogeneous clusters so that documents within the same cluster are more similar to each other than those belonging to other clusters [1]. While in the web environment, task seems more challenging. Essential clustering te...

Full description

Saved in:
Bibliographic Details
Main Author: Pan, Tianchi
Other Authors: Chen Lihui
Format: Final Year Project
Language:English
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/10356/54272
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-54272
record_format dspace
spelling sg-ntu-dr.10356-542722023-07-07T16:08:39Z Clustering techniques for web documents Pan, Tianchi Chen Lihui School of Electrical and Electronic Engineering DRNTU::Engineering Document clustering is a process of grouping documents into several natural and homogeneous clusters so that documents within the same cluster are more similar to each other than those belonging to other clusters [1]. While in the web environment, task seems more challenging. Essential clustering techniques need to be employed to facilitate the discovery knowledge in this process. K-means is one of the frequently used methods in data clustering; however, it will fail to find out the meaningful clustering result if input data is given in a less structured way. Therefore, in this report a new learning distance metric proposed by Eric P. Xing is implemented with supplementary side information to help improving the K-means clustering performance. New algorithm will be studied in details and validated on different datasets and its performance will be evaluated by some quantitative values: NMI, purity and random index using Java as well as cluster visualization using MATLAB. From the results obtained, we have found that new clustering algorithm has shown a pleasant improvement compared with the original one and might be used for future application in data clustering. Bachelor of Engineering 2013-06-18T04:22:13Z 2013-06-18T04:22:13Z 2013 2013 Final Year Project (FYP) http://hdl.handle.net/10356/54272 en Nanyang Technological University 53 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering
spellingShingle DRNTU::Engineering
Pan, Tianchi
Clustering techniques for web documents
description Document clustering is a process of grouping documents into several natural and homogeneous clusters so that documents within the same cluster are more similar to each other than those belonging to other clusters [1]. While in the web environment, task seems more challenging. Essential clustering techniques need to be employed to facilitate the discovery knowledge in this process. K-means is one of the frequently used methods in data clustering; however, it will fail to find out the meaningful clustering result if input data is given in a less structured way. Therefore, in this report a new learning distance metric proposed by Eric P. Xing is implemented with supplementary side information to help improving the K-means clustering performance. New algorithm will be studied in details and validated on different datasets and its performance will be evaluated by some quantitative values: NMI, purity and random index using Java as well as cluster visualization using MATLAB. From the results obtained, we have found that new clustering algorithm has shown a pleasant improvement compared with the original one and might be used for future application in data clustering.
author2 Chen Lihui
author_facet Chen Lihui
Pan, Tianchi
format Final Year Project
author Pan, Tianchi
author_sort Pan, Tianchi
title Clustering techniques for web documents
title_short Clustering techniques for web documents
title_full Clustering techniques for web documents
title_fullStr Clustering techniques for web documents
title_full_unstemmed Clustering techniques for web documents
title_sort clustering techniques for web documents
publishDate 2013
url http://hdl.handle.net/10356/54272
_version_ 1772827138688286720