Document clustering based on inverse document frequency measure

Automatic classification techniques are capable of providing the necessary information organization by arranging the retrieved data into groups of documents with common subjects. Recently, document clustering has been put forth as an alternative method of organizing the results of retrieval. It been...

Full description

Saved in:
Bibliographic Details
Main Author: Wan Faridah Hanum, Wan Yaacob
Format: Thesis
Language:English
English
Published: 2005
Subjects:
Online Access:http://etd.uum.edu.my/1367/1/WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf
http://etd.uum.edu.my/1367/2/1.WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf
http://etd.uum.edu.my/1367/
http://sierra.uum.edu.my/record=b1170635~S1
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Utara Malaysia
Language: English
English
id my.uum.etd.1367
record_format eprints
spelling my.uum.etd.13672019-11-12T02:13:09Z http://etd.uum.edu.my/1367/ Document clustering based on inverse document frequency measure Wan Faridah Hanum, Wan Yaacob HF5001-6182 Business Automatic classification techniques are capable of providing the necessary information organization by arranging the retrieved data into groups of documents with common subjects. Recently, document clustering has been put forth as an alternative method of organizing the results of retrieval. It been proposed for use in navigating and browsing document collections, and discovers hidden similarity and key concepts. It also summarize a large amount of document using key or common attributes of cluster and can be used to categorize document databases. This paper describes several narrative clustering techniques such as Porter algorithm, Gusfield algorithm, similarity based on document hierarchy and Inverse Document Frequency (IDF), which intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. This study proposes document clustering based on IDF, where it is assumes that importance of a keyword in calculating similarity measures is inversely proportional to the total number of documents that contain it. IDF is easy to understand, has a geometric interpretation, term weighing shown to help clustering, allow partial matching and returns ranked documents. An important finding in this study, where 30 cases of documents tested with the IDF algorithm, and the results are divided into three category; correct cluster, incorrect cluster, and unknown cluster. 2005-04-07 Thesis NonPeerReviewed application/pdf en http://etd.uum.edu.my/1367/1/WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf application/pdf en http://etd.uum.edu.my/1367/2/1.WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf Wan Faridah Hanum, Wan Yaacob (2005) Document clustering based on inverse document frequency measure. Masters thesis, Universiti Utara Malaysia. http://sierra.uum.edu.my/record=b1170635~S1
institution Universiti Utara Malaysia
building UUM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Utara Malaysia
content_source UUM Electronic Theses
url_provider http://etd.uum.edu.my/
language English
English
topic HF5001-6182 Business
spellingShingle HF5001-6182 Business
Wan Faridah Hanum, Wan Yaacob
Document clustering based on inverse document frequency measure
description Automatic classification techniques are capable of providing the necessary information organization by arranging the retrieved data into groups of documents with common subjects. Recently, document clustering has been put forth as an alternative method of organizing the results of retrieval. It been proposed for use in navigating and browsing document collections, and discovers hidden similarity and key concepts. It also summarize a large amount of document using key or common attributes of cluster and can be used to categorize document databases. This paper describes several narrative clustering techniques such as Porter algorithm, Gusfield algorithm, similarity based on document hierarchy and Inverse Document Frequency (IDF), which intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. This study proposes document clustering based on IDF, where it is assumes that importance of a keyword in calculating similarity measures is inversely proportional to the total number of documents that contain it. IDF is easy to understand, has a geometric interpretation, term weighing shown to help clustering, allow partial matching and returns ranked documents. An important finding in this study, where 30 cases of documents tested with the IDF algorithm, and the results are divided into three category; correct cluster, incorrect cluster, and unknown cluster.
format Thesis
author Wan Faridah Hanum, Wan Yaacob
author_facet Wan Faridah Hanum, Wan Yaacob
author_sort Wan Faridah Hanum, Wan Yaacob
title Document clustering based on inverse document frequency measure
title_short Document clustering based on inverse document frequency measure
title_full Document clustering based on inverse document frequency measure
title_fullStr Document clustering based on inverse document frequency measure
title_full_unstemmed Document clustering based on inverse document frequency measure
title_sort document clustering based on inverse document frequency measure
publishDate 2005
url http://etd.uum.edu.my/1367/1/WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf
http://etd.uum.edu.my/1367/2/1.WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf
http://etd.uum.edu.my/1367/
http://sierra.uum.edu.my/record=b1170635~S1
_version_ 1651870327907549184