Language identifications of Arabic script web documents using independent component analysis

We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selectio...

Full description

Saved in:
Bibliographic Details
Main Authors: Selamat, Ali, Lee, Zhi-Sam
Format: Book Section
Published: Institute of Electrical and Electronics Engineers 2008
Subjects:
Online Access:http://eprints.utm.my/id/eprint/12612/
http://dx.doi.org/10.1109/AMS.2008.46
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Malaysia
id my.utm.12612
record_format eprints
spelling my.utm.126122011-06-14T05:11:15Z http://eprints.utm.my/id/eprint/12612/ Language identifications of Arabic script web documents using independent component analysis Selamat, Ali Lee, Zhi-Sam QA75 Electronic computers. Computer science We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA. Institute of Electrical and Electronics Engineers 2008 Book Section PeerReviewed Selamat, Ali and Lee, Zhi-Sam (2008) Language identifications of Arabic script web documents using independent component analysis. In: Proceedings - 2nd Asia International Conference on Modelling and Simulation, AMS 2008. Institute of Electrical and Electronics Engineers, New York, 427 -432. ISBN 978-076953136-6 http://dx.doi.org/10.1109/AMS.2008.46 doi:10.1109/AMS.2008.46
institution Universiti Teknologi Malaysia
building UTM Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Teknologi Malaysia
content_source UTM Institutional Repository
url_provider http://eprints.utm.my/
topic QA75 Electronic computers. Computer science
spellingShingle QA75 Electronic computers. Computer science
Selamat, Ali
Lee, Zhi-Sam
Language identifications of Arabic script web documents using independent component analysis
description We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA.
format Book Section
author Selamat, Ali
Lee, Zhi-Sam
author_facet Selamat, Ali
Lee, Zhi-Sam
author_sort Selamat, Ali
title Language identifications of Arabic script web documents using independent component analysis
title_short Language identifications of Arabic script web documents using independent component analysis
title_full Language identifications of Arabic script web documents using independent component analysis
title_fullStr Language identifications of Arabic script web documents using independent component analysis
title_full_unstemmed Language identifications of Arabic script web documents using independent component analysis
title_sort language identifications of arabic script web documents using independent component analysis
publisher Institute of Electrical and Electronics Engineers
publishDate 2008
url http://eprints.utm.my/id/eprint/12612/
http://dx.doi.org/10.1109/AMS.2008.46
_version_ 1643645998079148032