Language identifications of Arabic script web documents using independent component analysis
We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selectio...
Saved in:
Main Authors: | , |
---|---|
Format: | Book Section |
Published: |
Institute of Electrical and Electronics Engineers
2008
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/12612/ http://dx.doi.org/10.1109/AMS.2008.46 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Teknologi Malaysia |
id |
my.utm.12612 |
---|---|
record_format |
eprints |
spelling |
my.utm.126122011-06-14T05:11:15Z http://eprints.utm.my/id/eprint/12612/ Language identifications of Arabic script web documents using independent component analysis Selamat, Ali Lee, Zhi-Sam QA75 Electronic computers. Computer science We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA. Institute of Electrical and Electronics Engineers 2008 Book Section PeerReviewed Selamat, Ali and Lee, Zhi-Sam (2008) Language identifications of Arabic script web documents using independent component analysis. In: Proceedings - 2nd Asia International Conference on Modelling and Simulation, AMS 2008. Institute of Electrical and Electronics Engineers, New York, 427 -432. ISBN 978-076953136-6 http://dx.doi.org/10.1109/AMS.2008.46 doi:10.1109/AMS.2008.46 |
institution |
Universiti Teknologi Malaysia |
building |
UTM Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Teknologi Malaysia |
content_source |
UTM Institutional Repository |
url_provider |
http://eprints.utm.my/ |
topic |
QA75 Electronic computers. Computer science |
spellingShingle |
QA75 Electronic computers. Computer science Selamat, Ali Lee, Zhi-Sam Language identifications of Arabic script web documents using independent component analysis |
description |
We analyze the language identification algorithms used to identify the Arabic script web documents such as Arabic, Jawi, Persian and Urdu using independent component analysis (ICA). We have used a combination of Entropy term weighting scheme and class based feature (CPBF) vectors as feature selection methods for selecting the best features of Arabic script web documents for web page language identifications. Then we input the selected features based on the identification of latent semantics of user profiles using singular value decomposition (SVD). The SVD has been used to remove the noises on the documents retrieved before applying the ICA for topic extraction. We assume that the topic on each document is independent from each other. We have used the information retrieval measures that are precision, recall and F1 in order to evaluate the effectiveness of the proposed algorithm. From the experiments, we have found that the proposed method could leads to good Arabic script language identification results with good separations of Arabic, Persian, and Urdu languages using the ICA. |
format |
Book Section |
author |
Selamat, Ali Lee, Zhi-Sam |
author_facet |
Selamat, Ali Lee, Zhi-Sam |
author_sort |
Selamat, Ali |
title |
Language identifications of Arabic script web documents using independent component analysis |
title_short |
Language identifications of Arabic script web documents using independent component analysis |
title_full |
Language identifications of Arabic script web documents using independent component analysis |
title_fullStr |
Language identifications of Arabic script web documents using independent component analysis |
title_full_unstemmed |
Language identifications of Arabic script web documents using independent component analysis |
title_sort |
language identifications of arabic script web documents using independent component analysis |
publisher |
Institute of Electrical and Electronics Engineers |
publishDate |
2008 |
url |
http://eprints.utm.my/id/eprint/12612/ http://dx.doi.org/10.1109/AMS.2008.46 |
_version_ |
1643645998079148032 |