A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING

HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves...

Full description

Saved in:
Bibliographic Details
Main Author: WIDYASTUTI , HILDA
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/8143
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:8143
spelling id-itb.:81432017-09-27T15:37:08ZA STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING WIDYASTUTI , HILDA Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/8143 HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves the sequential relationship between words in documents. These requirements will be satisfied by processing documents preserving the sequential aspect. HMRF-KMeans algorithm consists of initialization, expectation, and maximization step. Initialization step gets good initial centroids. The expectation step assigns the data point to the cluster that will minimize objective function. The maximization step will recalculate centroid and distance measure parameter, to minimize objective function. The expectation and maximization step is repeated until convergence condition. The development of HMRFKMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result shows that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, dan 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations. text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves the sequential relationship between words in documents. These requirements will be satisfied by processing documents preserving the sequential aspect. HMRF-KMeans algorithm consists of initialization, expectation, and maximization step. Initialization step gets good initial centroids. The expectation step assigns the data point to the cluster that will minimize objective function. The maximization step will recalculate centroid and distance measure parameter, to minimize objective function. The expectation and maximization step is repeated until convergence condition. The development of HMRFKMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result shows that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, dan 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations.
format Theses
author WIDYASTUTI , HILDA
spellingShingle WIDYASTUTI , HILDA
A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
author_facet WIDYASTUTI , HILDA
author_sort WIDYASTUTI , HILDA
title A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_short A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_full A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_fullStr A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_full_unstemmed A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_sort study of n-gram representation in hmrfkmeans algorithm for document clustering
url https://digilib.itb.ac.id/gdl/view/8143
_version_ 1820664338292146176