The hybrid feature selection k-means method for Arabic webpage classification

The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best...

Full description

Saved in:

Bibliographic Details
Main Authors:	Alghamdi, Hanan, Selamat, Ali
Format:	Article
Published:	Penerbit UTM Press 2014
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://eprints.utm.my/id/eprint/62935/ http://dx.doi.org/10.11113/jt.v70.3518
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Teknologi Malaysia

id	my.utm.62935
record_format	eprints
spelling	my.utm.629352017-11-01T04:17:08Z http://eprints.utm.my/id/eprint/62935/ The hybrid feature selection k-means method for Arabic webpage classification Alghamdi, Hanan Selamat, Ali QA75 Electronic computers. Computer science The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best solution. Therefore, in this paper, we propose a feature selection model that incorporates three different feature selection methods (CHI-squared, mutual information, and term frequency-inverse document frequency) to build a hybrid feature selection model (Hybrid-FS) for k-means clustering. This model represents text data in a high structure (consisting of three types of objects, namely, the terms, documents and categories). We evaluate the model on a set of common Arabic online newspapers. We assess the effect of using the Hybrid-FS with standard k-means clustering. The experimental results show that the proposed method increases purity by 28% and lowers the runtime by 80% compared to the standard k-means algorithm. We conclude that the proposed hybrid feature selection model enhances the accuracy of the k-means algorithm and successfully produces coherent-compact clusters that are well-separated when applied to high-dimensional datasets. Penerbit UTM Press 2014 Article PeerReviewed Alghamdi, Hanan and Selamat, Ali (2014) The hybrid feature selection k-means method for Arabic webpage classification. Jurnal Teknologi, 70 (5). pp. 73-79. ISSN 0127-9696 http://dx.doi.org/10.11113/jt.v70.3518 DOI:10.11113/jt.v70.3518
institution	Universiti Teknologi Malaysia
building	UTM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknologi Malaysia
content_source	UTM Institutional Repository
url_provider	http://eprints.utm.my/
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Alghamdi, Hanan Selamat, Ali The hybrid feature selection k-means method for Arabic webpage classification
description	The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best solution. Therefore, in this paper, we propose a feature selection model that incorporates three different feature selection methods (CHI-squared, mutual information, and term frequency-inverse document frequency) to build a hybrid feature selection model (Hybrid-FS) for k-means clustering. This model represents text data in a high structure (consisting of three types of objects, namely, the terms, documents and categories). We evaluate the model on a set of common Arabic online newspapers. We assess the effect of using the Hybrid-FS with standard k-means clustering. The experimental results show that the proposed method increases purity by 28% and lowers the runtime by 80% compared to the standard k-means algorithm. We conclude that the proposed hybrid feature selection model enhances the accuracy of the k-means algorithm and successfully produces coherent-compact clusters that are well-separated when applied to high-dimensional datasets.
format	Article
author	Alghamdi, Hanan Selamat, Ali
author_facet	Alghamdi, Hanan Selamat, Ali
author_sort	Alghamdi, Hanan
title	The hybrid feature selection k-means method for Arabic webpage classification
title_short	The hybrid feature selection k-means method for Arabic webpage classification
title_full	The hybrid feature selection k-means method for Arabic webpage classification
title_fullStr	The hybrid feature selection k-means method for Arabic webpage classification
title_full_unstemmed	The hybrid feature selection k-means method for Arabic webpage classification
title_sort	hybrid feature selection k-means method for arabic webpage classification
publisher	Penerbit UTM Press
publishDate	2014
url	http://eprints.utm.my/id/eprint/62935/ http://dx.doi.org/10.11113/jt.v70.3518
_version_	1643655568367288320

The hybrid feature selection k-means method for Arabic webpage classification

Similar Items