A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/8143 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:8143 |
---|---|
spelling |
id-itb.:81432017-09-27T15:37:08ZA STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING WIDYASTUTI , HILDA Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/8143 HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves the sequential relationship between words in documents. These requirements will be satisfied by processing documents preserving the sequential aspect. HMRF-KMeans algorithm consists of initialization, expectation, and maximization step. Initialization step gets good initial centroids. The expectation step assigns the data point to the cluster that will minimize objective function. The maximization step will recalculate centroid and distance measure parameter, to minimize objective function. The expectation and maximization step is repeated until convergence condition. The development of HMRFKMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result shows that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, dan 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves the sequential relationship between words in documents. These requirements will be satisfied by processing documents preserving the sequential aspect. HMRF-KMeans algorithm consists of initialization, expectation, and maximization step. Initialization step gets good initial centroids. The expectation step assigns the data point to the cluster that will minimize objective function. The maximization step will recalculate centroid and distance measure parameter, to minimize objective function. The expectation and maximization step is repeated until convergence condition. The development of HMRFKMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result shows that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, dan 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations. |
format |
Theses |
author |
WIDYASTUTI , HILDA |
spellingShingle |
WIDYASTUTI , HILDA A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING |
author_facet |
WIDYASTUTI , HILDA |
author_sort |
WIDYASTUTI , HILDA |
title |
A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING |
title_short |
A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING |
title_full |
A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING |
title_fullStr |
A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING |
title_full_unstemmed |
A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING |
title_sort |
study of n-gram representation in hmrfkmeans algorithm for document clustering |
url |
https://digilib.itb.ac.id/gdl/view/8143 |
_version_ |
1820664338292146176 |