HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE
<p align="justify">Automatic news categorization is essential to handle multi-variant news articles. This research employs hierarchical multilabel classification to conduct news categorization. Based on our previous research, performance of hierarchical multilabel classification mode...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/28175 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:28175 |
---|---|
spelling |
id-itb.:281752018-03-16T08:56:04ZHIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE CLAIRINE IRSAN - NIM: 23516081 , IVANA Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/28175 <p align="justify">Automatic news categorization is essential to handle multi-variant news articles. This research employs hierarchical multilabel classification to conduct news categorization. Based on our previous research, performance of hierarchical multilabel classification model needs to be improved. There are several things that could potentially improve hierarchical multilabel classification’s performance. First method is by using deep learning classifier to classify news at parent level, in this case, CNN will be used to build the classifier. Second, by using word vector’s average from word embedding, and the third method is by combining word’s term frequency with word’s vector average to build features that will be used to train the multilabel classifiers. Based on the result of this experiment, best performance was 75.31%, achieved by building Calibrated Label Ranking – Naïve Bayes model, and representing document by multiplying word’s term frequency with word’s vector average. This configuration improved multilabel classification performance by 4.25%, compared to the previous result. The distributed semantic model that contributed to achieve best performance was 300 dimension word2vec that was trained using Wikipedia’s articles. Moreover, multilabel classification model is also influenced by news’ release date. If the train data and the test data were collected from different time range, it would decrease model’s performance. This could be seen in this experiment’s results, as the model’s performance was decreased when 5635 data from latest timestamp were added as train data.<p align="justify"> <br /> text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
<p align="justify">Automatic news categorization is essential to handle multi-variant news articles. This research employs hierarchical multilabel classification to conduct news categorization. Based on our previous research, performance of hierarchical multilabel classification model needs to be improved. There are several things that could potentially improve hierarchical multilabel classification’s performance. First method is by using deep learning classifier to classify news at parent level, in this case, CNN will be used to build the classifier. Second, by using word vector’s average from word embedding, and the third method is by combining word’s term frequency with word’s vector average to build features that will be used to train the multilabel classifiers. Based on the result of this experiment, best performance was 75.31%, achieved by building Calibrated Label Ranking – Naïve Bayes model, and representing document by multiplying word’s term frequency with word’s vector average. This configuration improved multilabel classification performance by 4.25%, compared to the previous result. The distributed semantic model that contributed to achieve best performance was 300 dimension word2vec that was trained using Wikipedia’s articles. Moreover, multilabel classification model is also influenced by news’ release date. If the train data and the test data were collected from different time range, it would decrease model’s performance. This could be seen in this experiment’s results, as the model’s performance was decreased when 5635 data from latest timestamp were added as train data.<p align="justify"> <br />
|
format |
Theses |
author |
CLAIRINE IRSAN - NIM: 23516081 , IVANA |
spellingShingle |
CLAIRINE IRSAN - NIM: 23516081 , IVANA HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE |
author_facet |
CLAIRINE IRSAN - NIM: 23516081 , IVANA |
author_sort |
CLAIRINE IRSAN - NIM: 23516081 , IVANA |
title |
HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE |
title_short |
HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE |
title_full |
HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE |
title_fullStr |
HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE |
title_full_unstemmed |
HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE |
title_sort |
hierarchical multilabel classification using distributed semantic model based features for indonesian news article |
url |
https://digilib.itb.ac.id/gdl/view/28175 |
_version_ |
1822922495374655488 |