The hybrid feature selection k-means method for Arabic webpage classification

The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best...

Full description

Saved in:
Bibliographic Details
Main Authors: Alghamdi, Hanan, Selamat, Ali
Format: Article
Published: Penerbit UTM Press 2014
Subjects:
Online Access:http://eprints.utm.my/id/eprint/62935/
http://dx.doi.org/10.11113/jt.v70.3518
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Malaysia
id my.utm.62935
record_format eprints
spelling my.utm.629352017-11-01T04:17:08Z http://eprints.utm.my/id/eprint/62935/ The hybrid feature selection k-means method for Arabic webpage classification Alghamdi, Hanan Selamat, Ali QA75 Electronic computers. Computer science The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best solution. Therefore, in this paper, we propose a feature selection model that incorporates three different feature selection methods (CHI-squared, mutual information, and term frequency-inverse document frequency) to build a hybrid feature selection model (Hybrid-FS) for k-means clustering. This model represents text data in a high structure (consisting of three types of objects, namely, the terms, documents and categories). We evaluate the model on a set of common Arabic online newspapers. We assess the effect of using the Hybrid-FS with standard k-means clustering. The experimental results show that the proposed method increases purity by 28% and lowers the runtime by 80% compared to the standard k-means algorithm. We conclude that the proposed hybrid feature selection model enhances the accuracy of the k-means algorithm and successfully produces coherent-compact clusters that are well-separated when applied to high-dimensional datasets. Penerbit UTM Press 2014 Article PeerReviewed Alghamdi, Hanan and Selamat, Ali (2014) The hybrid feature selection k-means method for Arabic webpage classification. Jurnal Teknologi, 70 (5). pp. 73-79. ISSN 0127-9696 http://dx.doi.org/10.11113/jt.v70.3518 DOI:10.11113/jt.v70.3518
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Alghamdi, Hanan
Selamat, Ali
The hybrid feature selection k-means method for Arabic webpage classification
description The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best solution. Therefore, in this paper, we propose a feature selection model that incorporates three different feature selection methods (CHI-squared, mutual information, and term frequency-inverse document frequency) to build a hybrid feature selection model (Hybrid-FS) for k-means clustering. This model represents text data in a high structure (consisting of three types of objects, namely, the terms, documents and categories). We evaluate the model on a set of common Arabic online newspapers. We assess the effect of using the Hybrid-FS with standard k-means clustering. The experimental results show that the proposed method increases purity by 28% and lowers the runtime by 80% compared to the standard k-means algorithm. We conclude that the proposed hybrid feature selection model enhances the accuracy of the k-means algorithm and successfully produces coherent-compact clusters that are well-separated when applied to high-dimensional datasets.
format Article
author Alghamdi, Hanan
Selamat, Ali
author_facet Alghamdi, Hanan
Selamat, Ali
author_sort Alghamdi, Hanan
title The hybrid feature selection k-means method for Arabic webpage classification
title_short The hybrid feature selection k-means method for Arabic webpage classification
title_full The hybrid feature selection k-means method for Arabic webpage classification
title_fullStr The hybrid feature selection k-means method for Arabic webpage classification
title_full_unstemmed The hybrid feature selection k-means method for Arabic webpage classification
title_sort hybrid feature selection k-means method for arabic webpage classification
publisher Penerbit UTM Press
publishDate 2014
url http://eprints.utm.my/id/eprint/62935/
http://dx.doi.org/10.11113/jt.v70.3518
_version_ 1643655568367288320