A method for k-means-like clustering of categorical data

© 2019, Springer-Verlag GmbH Germany, part of Springer Nature. Despite recent efforts, the challenge in clustering categorical and mixed data in the context of big data still remains due to the lack of inherently meaningful measure of similarity between categorical objects and the high computational...

Full description

Saved in:

Bibliographic Details
Main Authors:	Thu Hien Thi Nguyen, Duy Tai Dinh, Songsak Sriboonchitta, Van Nam Huynh
Format:	Journal
Published:	2020
Subjects:	Computer Science
Online Access:	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85073982951&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/67757
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Chiang Mai University

id	th-cmuir.6653943832-67757
record_format	dspace
spelling	th-cmuir.6653943832-677572020-04-02T15:02:51Z A method for k-means-like clustering of categorical data Thu Hien Thi Nguyen Duy Tai Dinh Songsak Sriboonchitta Van Nam Huynh Computer Science © 2019, Springer-Verlag GmbH Germany, part of Springer Nature. Despite recent efforts, the challenge in clustering categorical and mixed data in the context of big data still remains due to the lack of inherently meaningful measure of similarity between categorical objects and the high computational complexity of existing clustering techniques. While k-means method is well known for its efficiency in clustering large data sets, working only on numerical data prohibits it from being applied for clustering categorical data. In this paper, we aim to develop a novel extension of k-means method for clustering categorical data, making use of an information theoretic-based dissimilarity measure and a kernel-based method for representation of cluster means for categorical objects. Such an approach allows us to formulate the problem of clustering categorical data in the fashion similar to k-means clustering, while a kernel-based definition of centers also provides an interpretation of cluster means being consistent with the statistical interpretation of the cluster means for numerical data. In order to demonstrate the performance of the new clustering method, a series of experiments on real datasets from UCI Machine Learning Repository are conducted and the obtained results are compared with several previously developed algorithms for clustering categorical data. 2020-04-02T15:02:51Z 2020-04-02T15:02:51Z 2019-01-01 Journal 18685145 18685137 2-s2.0-85073982951 10.1007/s12652-019-01445-5 https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85073982951&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/67757
institution	Chiang Mai University
building	Chiang Mai University Library
country	Thailand
collection	CMU Intellectual Repository
topic	Computer Science
spellingShingle	Computer Science Thu Hien Thi Nguyen Duy Tai Dinh Songsak Sriboonchitta Van Nam Huynh A method for k-means-like clustering of categorical data
description	© 2019, Springer-Verlag GmbH Germany, part of Springer Nature. Despite recent efforts, the challenge in clustering categorical and mixed data in the context of big data still remains due to the lack of inherently meaningful measure of similarity between categorical objects and the high computational complexity of existing clustering techniques. While k-means method is well known for its efficiency in clustering large data sets, working only on numerical data prohibits it from being applied for clustering categorical data. In this paper, we aim to develop a novel extension of k-means method for clustering categorical data, making use of an information theoretic-based dissimilarity measure and a kernel-based method for representation of cluster means for categorical objects. Such an approach allows us to formulate the problem of clustering categorical data in the fashion similar to k-means clustering, while a kernel-based definition of centers also provides an interpretation of cluster means being consistent with the statistical interpretation of the cluster means for numerical data. In order to demonstrate the performance of the new clustering method, a series of experiments on real datasets from UCI Machine Learning Repository are conducted and the obtained results are compared with several previously developed algorithms for clustering categorical data.
format	Journal
author	Thu Hien Thi Nguyen Duy Tai Dinh Songsak Sriboonchitta Van Nam Huynh
author_facet	Thu Hien Thi Nguyen Duy Tai Dinh Songsak Sriboonchitta Van Nam Huynh
author_sort	Thu Hien Thi Nguyen
title	A method for k-means-like clustering of categorical data
title_short	A method for k-means-like clustering of categorical data
title_full	A method for k-means-like clustering of categorical data
title_fullStr	A method for k-means-like clustering of categorical data
title_full_unstemmed	A method for k-means-like clustering of categorical data
title_sort	method for k-means-like clustering of categorical data
publishDate	2020
url	https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85073982951&origin=inward http://cmuir.cmu.ac.th/jspui/handle/6653943832/67757
_version_	1681426694178603008

A method for k-means-like clustering of categorical data

Similar Items