A STUDY OF N-GRAM REPRESENTATION IN HMRFKMEANS ALGORITHM FOR DOCUMENT CLUSTERING
HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/8143 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | HMRF-KMeans algorithm satisfies some requirements of document clustering, including high dimensionality data, scalability, accuracy, independent to the prior domain knowledge. The algorithm does not satisfy the requirements about meaningful cluster description and data representation that preserves the sequential relationship between words in documents. These requirements will be satisfied by processing documents preserving the sequential aspect. HMRF-KMeans algorithm consists of initialization, expectation, and maximization step. Initialization step gets good initial centroids. The expectation step assigns the data point to the cluster that will minimize objective function. The maximization step will recalculate centroid and distance measure parameter, to minimize objective function. The expectation and maximization step is repeated until convergence condition. The development of HMRFKMeans algorithm for document clustering includes using n-gram representation, dimension reduction, using cosine distance measure, and labelling cluster by IGm method. Experiment result shows that in n-gram representations the difference clustering result accuracy among 1-gram, 2-gram, 3-gram, dan 4-gram representations is not significant because 1-gram dominates other n-grams. Clustering experiment with constraints results in more accurate cluster than that of without constraint, and cluster labels from 1-gram representation have the best quality, followed by 2-gram, 3-gram, and 4-gram representations. In HMRF-KMeans algorithm, 1-gram representation provides the most accurate cluster and the best quality of cluster label than n-gram representations. |
---|