Language identifications of Arabic script web documents using independent component analysis

We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selectio...

Full description

Saved in:

Bibliographic Details
Main Authors:	Selamat, Ali, Lee, Zhi-Sam
Format:	Book Section
Published:	Institute of Electrical and Electronics Engineers 2008
Subjects:	QA75 Electronic computers. Computer science
Online Access:	http://eprints.utm.my/id/eprint/12612/ http://dx.doi.org/10.1109/AMS.2008.46
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Teknologi Malaysia

id	my.utm.12612
record_format	eprints
spelling	my.utm.126122011-06-14T05:11:15Z http://eprints.utm.my/id/eprint/12612/ Language identifications of Arabic script web documents using independent component analysis Selamat, Ali Lee, Zhi-Sam QA75 Electronic computers. Computer science We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA. Institute of Electrical and Electronics Engineers 2008 Book Section PeerReviewed Selamat, Ali and Lee, Zhi-Sam (2008) Language identifications of Arabic script web documents using independent component analysis. In: Proceedings - 2nd Asia International Conference on Modelling and Simulation, AMS 2008. Institute of Electrical and Electronics Engineers, New York, 427 -432. ISBN 978-076953136-6 http://dx.doi.org/10.1109/AMS.2008.46 doi:10.1109/AMS.2008.46
institution	Universiti Teknologi Malaysia
building	UTM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknologi Malaysia
content_source	UTM Institutional Repository
url_provider	http://eprints.utm.my/
topic	QA75 Electronic computers. Computer science
spellingShingle	QA75 Electronic computers. Computer science Selamat, Ali Lee, Zhi-Sam Language identifications of Arabic script web documents using independent component analysis
description	We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.
format	Book Section
author	Selamat, Ali Lee, Zhi-Sam
author_facet	Selamat, Ali Lee, Zhi-Sam
author_sort	Selamat, Ali
title	Language identifications of Arabic script web documents using independent component analysis
title_short	Language identifications of Arabic script web documents using independent component analysis
title_full	Language identifications of Arabic script web documents using independent component analysis
title_fullStr	Language identifications of Arabic script web documents using independent component analysis
title_full_unstemmed	Language identifications of Arabic script web documents using independent component analysis
title_sort	language identifications of arabic script web documents using independent component analysis
publisher	Institute of Electrical and Electronics Engineers
publishDate	2008
url	http://eprints.utm.my/id/eprint/12612/ http://dx.doi.org/10.1109/AMS.2008.46
_version_	1643645998079148032

Language identifications of Arabic script web documents using independent component analysis

Similar Items