An intelligent categorization tool for malay research articles
Unlabeled research articles published in Malay language are becoming increas ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l...
Saved in:
Main Authors: | , , |
---|---|
Format: | Research Report |
Language: | English |
Published: |
Universiti Malaysia Sabah
2015
|
Subjects: | |
Online Access: | https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf https://eprints.ums.edu.my/id/eprint/24678/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Malaysia Sabah |
Language: | English |
id |
my.ums.eprints.24678 |
---|---|
record_format |
eprints |
spelling |
my.ums.eprints.246782020-01-29T02:41:12Z https://eprints.ums.edu.my/id/eprint/24678/ An intelligent categorization tool for malay research articles Mohd Norhisham Razali Rayner Alfred Chin, Kim On P Philology. Linguistics Unlabeled research articles published in Malay language are becoming increas ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l\lalay language, these research articles must be categorized or indexed efficiently so that appro priate and relevant domains of knowledge can be recommended to researchers in l\falaysia. There are not many researches conducted to efficiently categorize Malay research articles. The task of categorizing Malay research articles is more complex compared to the task of categorizing English research articles due to the complexity of Malay language and thus categorizing Malay research articles represents a major contemporary challenge. Malay text documents are often represented as high-dimensional and sparse vectors, by using Malay words as features, which consist of a few thousand dimensions and a sparsity of 95 to 99% is typical. Determining the appropriate number of categories for large amount of Malay documents is also difficult and time consuming task due to the sparsity of the documents. Related documents may be grouped into different clusters, if there are too many number of categories assigned to these documents. On the other hand, unrelated documents may be clustered into the same cluster, if there are too few number of categories assigned to these documents. This research addresses issues that involve improving several pre-processing processes that affect the performance of the clustering process. These pre-processing processes include stemming, part-of-speech tagging and named-entity recognition. In this work, the effects of improving all these pre-processing processes will be investigated. It is anticipated that by improving the clustering results, it will also improve the mapping of Malay and English clusters obtained from the bilingual clustering. Hence, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. As a result, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. In this study, a genetic algorit.hm {GA) is also proposed to be implemented in order to determine the set of terms that can be used in clustering bilingual documents with more effective. Universiti Malaysia Sabah 2015 Research Report NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf Mohd Norhisham Razali and Rayner Alfred and Chin, Kim On (2015) An intelligent categorization tool for malay research articles. (Unpublished) |
institution |
Universiti Malaysia Sabah |
building |
UMS Library |
collection |
Institutional Repository |
continent |
Asia |
country |
Malaysia |
content_provider |
Universiti Malaysia Sabah |
content_source |
UMS Institutional Repository |
url_provider |
http://eprints.ums.edu.my/ |
language |
English |
topic |
P Philology. Linguistics |
spellingShingle |
P Philology. Linguistics Mohd Norhisham Razali Rayner Alfred Chin, Kim On An intelligent categorization tool for malay research articles |
description |
Unlabeled research articles published in Malay language are becoming increas ingly common and available in Malaysia. Thus, the task of manually indexing these research articles is difficult and time consuming. In order to facilitate research activities that depend on research resources written in l\lalay language, these research articles must be categorized or indexed efficiently so that appro priate and relevant domains of knowledge can be recommended to researchers in l\falaysia. There are not many researches conducted to efficiently categorize Malay research articles. The task of categorizing Malay research articles is more complex compared to the task of categorizing English research articles due to the complexity of Malay language and thus categorizing Malay research articles represents a major contemporary challenge. Malay text documents are often represented as high-dimensional and sparse vectors, by using Malay words as features, which consist of a few thousand dimensions and a sparsity of 95 to 99% is typical. Determining the appropriate number of categories for large amount of Malay documents is also difficult and time consuming task due to the sparsity of the documents. Related documents may be grouped into different clusters, if there are too many number of categories assigned to these documents. On the other hand, unrelated documents may be clustered into the same cluster, if there are too few number of categories assigned to these documents. This research addresses issues that involve improving several pre-processing processes that affect the performance of the clustering process. These pre-processing processes include stemming, part-of-speech tagging and named-entity recognition. In this work, the effects of improving all these pre-processing processes will be investigated. It is anticipated that by improving the clustering results, it will also improve the mapping of Malay and English clusters obtained from the bilingual clustering. Hence, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. As a result, by increasing the mapping percentage for the bilingual clusters, a more robust clustering algorithm can be developed for clustering bilingual documents. In this study, a genetic algorit.hm {GA) is also proposed to be implemented in order to determine the set of terms that can be used in clustering bilingual documents with more effective. |
format |
Research Report |
author |
Mohd Norhisham Razali Rayner Alfred Chin, Kim On |
author_facet |
Mohd Norhisham Razali Rayner Alfred Chin, Kim On |
author_sort |
Mohd Norhisham Razali |
title |
An intelligent categorization tool for malay research articles |
title_short |
An intelligent categorization tool for malay research articles |
title_full |
An intelligent categorization tool for malay research articles |
title_fullStr |
An intelligent categorization tool for malay research articles |
title_full_unstemmed |
An intelligent categorization tool for malay research articles |
title_sort |
intelligent categorization tool for malay research articles |
publisher |
Universiti Malaysia Sabah |
publishDate |
2015 |
url |
https://eprints.ums.edu.my/id/eprint/24678/1/An%20intelligent%20categorization%20tool%20for%20malay%20research%20articles.pdf https://eprints.ums.edu.my/id/eprint/24678/ |
_version_ |
1760230268249047040 |