INDONESIAN HOAX NEWS CLASSIFICATION USING FEATURE SELECTION
Classification of the hoax information needs to be done because hoax contains a misguided and dangerous information. Classification previously is on the email and sms hoax. The classification of the hoax article has not been done. Required feature selection to improve the accuracy of the classifi...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/51859 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Classification of the hoax information needs to be done because hoax contains a
misguided and dangerous information. Classification previously is on the email and
sms hoax. The classification of the hoax article has not been done. Required feature
selection to improve the accuracy of the classification hoax article.
In this study, collection of Indonesian hoax news is preprocessed then feature
selection experiments performed using union and intersection. Type of feature
selection which used are information gain, mutual information, chi-square, term
frequency and TFxIDF which classified using Naive Bayes, SVM and C4.5 with
unigram, bigram and a mixture of both as a model feature. With 220 articles as
document collection (89 hoax dan 131 non hoax articles) from 22 topics, where
every topic has 10 articles (hoax and non hoax). It has been done 270 testing for
feature selection without combination and 360 testing for feature selection with
union and intersection combination with parameters such as 3x feature model, 2x
stemming test, 2x stopword elimination test, 5x feature selection, 3x classifier dan
3x variant number of feature.
The best result was found from combination of feature selection with the union
operating between mutual information and information gain of 91.36%. Where only
by using information gain alone yielded 90.45%. Meanwhile, by using intersection
operations generated value accuracy under both which amounted to 90%. This
testing is done with a model 10-fold cross validation. F1 model with the best on
incorrect analysis is able to achieve 1 and the lowest is 0.815. These experiments
also showed that the probability-based feature selection is better than that based
on frequency. |
---|