A category classification algorithm for Indonesian and Malay news documents
Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited comp...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Published: |
Penerbit UTM Press
2016
|
Online Access: | https://www.scopus.com/inward/record.uri?eid=2-s2.0-84988430997&doi=10.11113%2fjt.v78.9549&partnerID=40&md5=ecdbab4a964888b760afd4013033549a http://eprints.utp.edu.my/25485/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Teknologi Petronas |
Summary: | Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014-2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63 for language identification, and 97.5 for category classification. While the category classifier works optimally on n = 60, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification. © 2016 Penerbit UTM Press. All rights reserved. |
---|