Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages

In the literature, high-dimensional data reduces the efficiency of clustering algorithms. Clustering the Arabic text is challenging because semantics of the text involves deep semantic processing. To overcome the problems, the feature selection and reduction methods have become essential to select a...

Full description

Saved in:

Bibliographic Details
Main Author:	Alghamdi, Hanan Musafer H.
Format:	Thesis
Language:	English
Published:	2016
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://eprints.utm.my/id/eprint/84043/1/HananMusaferPFC2016.pdf http://eprints.utm.my/id/eprint/84043/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:125988
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Teknologi Malaysia
Language:	English

id	my.utm.84043
record_format	eprints
spelling	my.utm.840432019-11-05T04:36:03Z http://eprints.utm.my/id/eprint/84043/ Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages Alghamdi, Hanan Musafer H. QA75 Electronic computers. Computer science In the literature, high-dimensional data reduces the efficiency of clustering algorithms. Clustering the Arabic text is challenging because semantics of the text involves deep semantic processing. To overcome the problems, the feature selection and reduction methods have become essential to select and identify the appropriate features in reducing high-dimensional space. There is a need to develop a suitable design for feature selection and reduction methods that would result in a more relevant, meaningful and reduced representation of the Arabic texts to ease the clustering process. The research developed three different methods for analyzing the features of the Arabic Web text. The first method is based on hybrid feature selection that selects the informative term representation within the Arabic Web pages. It incorporates three different feature selection methods known as Chi-square, Mutual Information and Term Frequency–Inverse Document Frequency to build a hybrid model. The second method is a latent document vectorization method used to represent the documents as the probability distribution in the vector space. It overcomes the problems of high-dimension by reducing the dimensional space. To extract the best features, two document vectorizer methods have been implemented, known as the Bayesian vectorizer and semantic vectorizer. The third method is an Arabic semantic feature analysis used to improve the capability of the Arabic Web analysis. It ensures a good design for the clustering method to optimize clustering ability when analysing these Web pages. This is done by overcoming the problems of term representation, semantic modeling and dimensional reduction. Different experiments were carried out with k-means clustering on two different data sets. The methods provided solutions to reduce high-dimensional data and identify the semantic features shared between similar Arabic Web pages that are grouped together in one cluster. These pages were clustered according to the semantic similarities between them whereby they have a small Davies–Bouldin index and high accuracy. This study contributed to research in clustering algorithm by developing three methods to identify the most relevant features of the Arabic Web pages. 2016-12 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/id/eprint/84043/1/HananMusaferPFC2016.pdf Alghamdi, Hanan Musafer H. (2016) Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages. PhD thesis, Universiti Teknologi Malaysia, Faculty of Computing. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:125988
institution	Universiti Teknologi Malaysia
building	UTM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknologi Malaysia
content_source	UTM Institutional Repository
url_provider	http://eprints.utm.my/
language	English
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Alghamdi, Hanan Musafer H. Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages
description	In the literature, high-dimensional data reduces the efficiency of clustering algorithms. Clustering the Arabic text is challenging because semantics of the text involves deep semantic processing. To overcome the problems, the feature selection and reduction methods have become essential to select and identify the appropriate features in reducing high-dimensional space. There is a need to develop a suitable design for feature selection and reduction methods that would result in a more relevant, meaningful and reduced representation of the Arabic texts to ease the clustering process. The research developed three different methods for analyzing the features of the Arabic Web text. The first method is based on hybrid feature selection that selects the informative term representation within the Arabic Web pages. It incorporates three different feature selection methods known as Chi-square, Mutual Information and Term Frequency–Inverse Document Frequency to build a hybrid model. The second method is a latent document vectorization method used to represent the documents as the probability distribution in the vector space. It overcomes the problems of high-dimension by reducing the dimensional space. To extract the best features, two document vectorizer methods have been implemented, known as the Bayesian vectorizer and semantic vectorizer. The third method is an Arabic semantic feature analysis used to improve the capability of the Arabic Web analysis. It ensures a good design for the clustering method to optimize clustering ability when analysing these Web pages. This is done by overcoming the problems of term representation, semantic modeling and dimensional reduction. Different experiments were carried out with k-means clustering on two different data sets. The methods provided solutions to reduce high-dimensional data and identify the semantic features shared between similar Arabic Web pages that are grouped together in one cluster. These pages were clustered according to the semantic similarities between them whereby they have a small Davies–Bouldin index and high accuracy. This study contributed to research in clustering algorithm by developing three methods to identify the most relevant features of the Arabic Web pages.
format	Thesis
author	Alghamdi, Hanan Musafer H.
author_facet	Alghamdi, Hanan Musafer H.
author_sort	Alghamdi, Hanan Musafer H.
title	Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages
title_short	Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages
title_full	Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages
title_fullStr	Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages
title_full_unstemmed	Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages
title_sort	semantic feature reduction and hybrid feature selection for clustering of arabic web pages
publishDate	2016
url	http://eprints.utm.my/id/eprint/84043/1/HananMusaferPFC2016.pdf http://eprints.utm.my/id/eprint/84043/ http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:125988
_version_	1651866752164823040

Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages

Similar Items