An intelligent categorization tool for malay research articles

Unlabeled research articles published in Malay language are becoming increas­ ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohd Norhisham Razali, Rayner Alfred, Chin, Kim On
Format: Research Report
Language:English
Published: Universiti Malaysia Sabah 2015
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf
https://eprints.ums.edu.my/id/eprint/24678/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Malaysia Sabah
Language: English
id my.ums.eprints.24678
record_format eprints
spelling my.ums.eprints.246782020-01-29T02:41:12Z https://eprints.ums.edu.my/id/eprint/24678/ An intelligent categorization tool for malay research articles Mohd Norhisham Razali Rayner Alfred Chin, Kim On P Philology. Linguistics Unlabeled research articles published in Malay language are becoming increas­ ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l\lalay language, these research articles must be categorized or indexed efficiently so that appro­ priate and relevant domains of knowledge can be recommended to researchers in l\falaysia. There are not many researches conducted to efficiently categorize Malay research articles. The task of categorizing Malay research articles is more complex compared to the task of categorizing English research articles due to the complexity of Malay language and thus categorizing Malay research articles represents a major contemporary challenge. Malay text documents are often represented as high-dimensional and sparse vectors, by using Malay words as features, which consist of a few thousand dimensions and a sparsity of 95 to 99% is typical. Determining the appropriate number of categories for large amount of Malay documents is also difficult and time consuming task due to the sparsity of the documents. Related documents may be grouped into different clusters, if there are too many number of categories assigned to these documents. On the other hand, unrelated documents may be clustered into the same cluster, if there are too few number of categories assigned to these documents. This research ad­dresses issues that involve improving several pre-processing processes that affect the performance of the clustering process. These pre-processing processes include stemming, part-of-speech tagging and named-entity recognition. In this work, the effects of improving all these pre-processing processes will be investigated. It is anticipated that by improving the clustering results, it will also improve the mapping of Malay and English clusters obtained from the bilingual clustering. Hence, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. As a result, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. In this study, a genetic algorit.hm {GA) is also proposed to be implemented in order to determine the set of terms that can be used in clustering bilingual documents with more effective. Universiti Malaysia Sabah 2015 Research Report NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf Mohd Norhisham Razali and Rayner Alfred and Chin, Kim On (2015) An intelligent categorization tool for malay research articles. (Unpublished)
institution Universiti Malaysia Sabah
building UMS Library
collection Institutional Repository
continent Asia
country Malaysia
content_provider Universiti Malaysia Sabah
content_source UMS Institutional Repository
url_provider http://eprints.ums.edu.my/
language English
topic P Philology. Linguistics
spellingShingle P Philology. Linguistics
Mohd Norhisham Razali
Rayner Alfred
Chin, Kim On
An intelligent categorization tool for malay research articles
description Unlabeled research articles published in Malay language are becoming increas­ ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l\lalay language, these research articles must be categorized or indexed efficiently so that appro­ priate and relevant domains of knowledge can be recommended to researchers in l\falaysia. There are not many researches conducted to efficiently categorize Malay research articles. The task of categorizing Malay research articles is more complex compared to the task of categorizing English research articles due to the complexity of Malay language and thus categorizing Malay research articles represents a major contemporary challenge. Malay text documents are often represented as high-dimensional and sparse vectors, by using Malay words as features, which consist of a few thousand dimensions and a sparsity of 95 to 99% is typical. Determining the appropriate number of categories for large amount of Malay documents is also difficult and time consuming task due to the sparsity of the documents. Related documents may be grouped into different clusters, if there are too many number of categories assigned to these documents. On the other hand, unrelated documents may be clustered into the same cluster, if there are too few number of categories assigned to these documents. This research ad­dresses issues that involve improving several pre-processing processes that affect the performance of the clustering process. These pre-processing processes include stemming, part-of-speech tagging and named-entity recognition. In this work, the effects of improving all these pre-processing processes will be investigated. It is anticipated that by improving the clustering results, it will also improve the mapping of Malay and English clusters obtained from the bilingual clustering. Hence, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. As a result, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. In this study, a genetic algorit.hm {GA) is also proposed to be implemented in order to determine the set of terms that can be used in clustering bilingual documents with more effective.
format Research Report
author Mohd Norhisham Razali
Rayner Alfred
Chin, Kim On
author_facet Mohd Norhisham Razali
Rayner Alfred
Chin, Kim On
author_sort Mohd Norhisham Razali
title An intelligent categorization tool for malay research articles
title_short An intelligent categorization tool for malay research articles
title_full An intelligent categorization tool for malay research articles
title_fullStr An intelligent categorization tool for malay research articles
title_full_unstemmed An intelligent categorization tool for malay research articles
title_sort intelligent categorization tool for malay research articles
publisher Universiti Malaysia Sabah
publishDate 2015
url https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf
https://eprints.ums.edu.my/id/eprint/24678/
_version_ 1760230268249047040