HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE

<p align="justify">Automatic news categorization is essential to handle multi-variant news articles. This research employs hierarchical multilabel classification to conduct news categorization. Based on our previous research, performance of hierarchical multilabel classification mode...

Full description

Saved in:
Bibliographic Details
Main Author: CLAIRINE IRSAN - NIM: 23516081 , IVANA
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/28175
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
id id-itb.:28175
spelling id-itb.:281752018-03-16T08:56:04ZHIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE CLAIRINE IRSAN - NIM: 23516081 , IVANA Indonesia Theses INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/28175 <p align="justify">Automatic news categorization is essential to handle multi-variant news articles. This research employs hierarchical multilabel classification to conduct news categorization. Based on our previous research, performance of hierarchical multilabel classification model needs to be improved. There are several things that could potentially improve hierarchical multilabel classification’s performance. First method is by using deep learning classifier to classify news at parent level, in this case, CNN will be used to build the classifier. Second, by using word vector’s average from word embedding, and the third method is by combining word’s term frequency with word’s vector average to build features that will be used to train the multilabel classifiers. Based on the result of this experiment, best performance was 75.31%, achieved by building Calibrated Label Ranking – Naïve Bayes model, and representing document by multiplying word’s term frequency with word’s vector average. This configuration improved multilabel classification performance by 4.25%, compared to the previous result. The distributed semantic model that contributed to achieve best performance was 300 dimension word2vec that was trained using Wikipedia’s articles. Moreover, multilabel classification model is also influenced by news’ release date. If the train data and the test data were collected from different time range, it would decrease model’s performance. This could be seen in this experiment’s results, as the model’s performance was decreased when 5635 data from latest timestamp were added as train data.<p align="justify"> <br /> text
institution Institut Teknologi Bandung
building Institut Teknologi Bandung Library
continent Asia
country Indonesia
Indonesia
content_provider Institut Teknologi Bandung
collection Digital ITB
language Indonesia
description <p align="justify">Automatic news categorization is essential to handle multi-variant news articles. This research employs hierarchical multilabel classification to conduct news categorization. Based on our previous research, performance of hierarchical multilabel classification model needs to be improved. There are several things that could potentially improve hierarchical multilabel classification’s performance. First method is by using deep learning classifier to classify news at parent level, in this case, CNN will be used to build the classifier. Second, by using word vector’s average from word embedding, and the third method is by combining word’s term frequency with word’s vector average to build features that will be used to train the multilabel classifiers. Based on the result of this experiment, best performance was 75.31%, achieved by building Calibrated Label Ranking – Naïve Bayes model, and representing document by multiplying word’s term frequency with word’s vector average. This configuration improved multilabel classification performance by 4.25%, compared to the previous result. The distributed semantic model that contributed to achieve best performance was 300 dimension word2vec that was trained using Wikipedia’s articles. Moreover, multilabel classification model is also influenced by news’ release date. If the train data and the test data were collected from different time range, it would decrease model’s performance. This could be seen in this experiment’s results, as the model’s performance was decreased when 5635 data from latest timestamp were added as train data.<p align="justify"> <br />
format Theses
author CLAIRINE IRSAN - NIM: 23516081 , IVANA
spellingShingle CLAIRINE IRSAN - NIM: 23516081 , IVANA
HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE
author_facet CLAIRINE IRSAN - NIM: 23516081 , IVANA
author_sort CLAIRINE IRSAN - NIM: 23516081 , IVANA
title HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE
title_short HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE
title_full HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE
title_fullStr HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE
title_full_unstemmed HIERARCHICAL MULTILABEL CLASSIFICATION USING DISTRIBUTED SEMANTIC MODEL BASED FEATURES FOR INDONESIAN NEWS ARTICLE
title_sort hierarchical multilabel classification using distributed semantic model based features for indonesian news article
url https://digilib.itb.ac.id/gdl/view/28175
_version_ 1822922495374655488