An intelligent categorization tool for malay research articles

Unlabeled research articles published in Malay language are becoming increas ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mohd Norhisham Razali, Rayner Alfred, Chin, Kim On
Format:	Research Report
Language:	English
Published:	Universiti Malaysia Sabah 2015
Subjects:	P Philology. Linguistics
Online Access:	https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf https://eprints.ums.edu.my/id/eprint/24678/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Malaysia Sabah
Language:	English

id	my.ums.eprints.24678
record_format	eprints
spelling	my.ums.eprints.246782020-01-29T02:41:12Z https://eprints.ums.edu.my/id/eprint/24678/ An intelligent categorization tool for malay research articles Mohd Norhisham Razali Rayner Alfred Chin, Kim On P Philology. Linguistics Unlabeled research articles published in Malay language are becoming increas ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l\lalay language, these research articles must be categorized or indexed efficiently so that appro priate and relevant domains of knowledge can be recommended to researchers in l\falaysia. There are not many researches conducted to efficiently categorize Malay research articles. The task of categorizing Malay research articles is more complex compared to the task of categorizing English research articles due to the complexity of Malay language and thus categorizing Malay research articles represents a major contemporary challenge. Malay text documents are often represented as high-dimensional and sparse vectors, by using Malay words as features, which consist of a few thousand dimensions and a sparsity of 95 to 99% is typical. Determining the appropriate number of categories for large amount of Malay documents is also difficult and time consuming task due to the sparsity of the documents. Related documents may be grouped into different clusters, if there are too many number of categories assigned to these documents. On the other hand, unrelated documents may be clustered into the same cluster, if there are too few number of categories assigned to these documents. This research addresses issues that involve improving several pre-processing processes that affect the performance of the clustering process. These pre-processing processes include stemming, part-of-speech tagging and named-entity recognition. In this work, the effects of improving all these pre-processing processes will be investigated. It is anticipated that by improving the clustering results, it will also improve the mapping of Malay and English clusters obtained from the bilingual clustering. Hence, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. As a result, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. In this study, a genetic algorit.hm {GA) is also proposed to be implemented in order to determine the set of terms that can be used in clustering bilingual documents with more effective. Universiti Malaysia Sabah 2015 Research Report NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf Mohd Norhisham Razali and Rayner Alfred and Chin, Kim On (2015) An intelligent categorization tool for malay research articles. (Unpublished)
institution	Universiti Malaysia Sabah
building	UMS Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Malaysia Sabah
content_source	UMS Institutional Repository
url_provider	http://eprints.ums.edu.my/
language	English
topic	P Philology. Linguistics
spellingShingle	P Philology. Linguistics Mohd Norhisham Razali Rayner Alfred Chin, Kim On An intelligent categorization tool for malay research articles
description	Unlabeled research articles published in Malay language are becoming increas ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l\lalay language, these research articles must be categorized or indexed efficiently so that appro priate and relevant domains of knowledge can be recommended to researchers in l\falaysia. There are not many researches conducted to efficiently categorize Malay research articles. The task of categorizing Malay research articles is more complex compared to the task of categorizing English research articles due to the complexity of Malay language and thus categorizing Malay research articles represents a major contemporary challenge. Malay text documents are often represented as high-dimensional and sparse vectors, by using Malay words as features, which consist of a few thousand dimensions and a sparsity of 95 to 99% is typical. Determining the appropriate number of categories for large amount of Malay documents is also difficult and time consuming task due to the sparsity of the documents. Related documents may be grouped into different clusters, if there are too many number of categories assigned to these documents. On the other hand, unrelated documents may be clustered into the same cluster, if there are too few number of categories assigned to these documents. This research addresses issues that involve improving several pre-processing processes that affect the performance of the clustering process. These pre-processing processes include stemming, part-of-speech tagging and named-entity recognition. In this work, the effects of improving all these pre-processing processes will be investigated. It is anticipated that by improving the clustering results, it will also improve the mapping of Malay and English clusters obtained from the bilingual clustering. Hence, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. As a result, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. In this study, a genetic algorit.hm {GA) is also proposed to be implemented in order to determine the set of terms that can be used in clustering bilingual documents with more effective.
format	Research Report
author	Mohd Norhisham Razali Rayner Alfred Chin, Kim On
author_facet	Mohd Norhisham Razali Rayner Alfred Chin, Kim On
author_sort	Mohd Norhisham Razali
title	An intelligent categorization tool for malay research articles
title_short	An intelligent categorization tool for malay research articles
title_full	An intelligent categorization tool for malay research articles
title_fullStr	An intelligent categorization tool for malay research articles
title_full_unstemmed	An intelligent categorization tool for malay research articles
title_sort	intelligent categorization tool for malay research articles
publisher	Universiti Malaysia Sabah
publishDate	2015
url	https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf https://eprints.ums.edu.my/id/eprint/24678/
_version_	1760230268249047040

An intelligent categorization tool for malay research articles

Similar Items