A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING

HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves...

Full description

Saved in:

Bibliographic Details
Main Author:	WIDYASTUTI , HILDA
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/8143
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:8143
spelling	id-itb.:81432017-09-27T15:37:08ZA STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING WIDYASTUTI , HILDA Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/8143 HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves the sequential relationship between words in documents. These requirements will be satisfied by processing documents preserving the sequential aspect. HMRF-KMeans algorithm consists of initialization, expectation, and maximization step. Initialization step gets good initial centroids. The expectation step assigns the data point to the cluster that will minimize objective function. The maximization step will recalculate centroid and distance measure parameter, to minimize objective function. The expectation and maximization step is repeated until convergence condition. The development of HMRFKMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result shows that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, dan 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves the sequential relationship between words in documents. These requirements will be satisfied by processing documents preserving the sequential aspect. HMRF-KMeans algorithm consists of initialization, expectation, and maximization step. Initialization step gets good initial centroids. The expectation step assigns the data point to the cluster that will minimize objective function. The maximization step will recalculate centroid and distance measure parameter, to minimize objective function. The expectation and maximization step is repeated until convergence condition. The development of HMRFKMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result shows that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, dan 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations.
format	Theses
author	WIDYASTUTI , HILDA
spellingShingle	WIDYASTUTI , HILDA A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
author_facet	WIDYASTUTI , HILDA
author_sort	WIDYASTUTI , HILDA
title	A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_short	A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_full	A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_fullStr	A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_full_unstemmed	A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
title_sort	study of n-gram representation in hmrfkmeans algorithm for document clustering
url	https://digilib.itb.ac.id/gdl/view/8143
_version_	1820664338292146176

A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING

Similar Items