CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT
The report is the text of public information regarding complaints in their area. It needs to be classified into specific categories and sub-categories to make it easier for the government to report the report. Currently, Jakarta Smart City still uses humans to classify report texts manually. This...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/65934 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:65934 |
---|---|
spelling |
id-itb.:659342022-06-25T23:36:39ZCLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT Aruda Lisjana, Oktefvia Indonesia Theses classification, clustering, complaint text, deep learning, Latent Dirichlet Allocation, Term Frequency-Inverse Cluster Frequency INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/65934 The report is the text of public information regarding complaints in their area. It needs to be classified into specific categories and sub-categories to make it easier for the government to report the report. Currently, Jakarta Smart City still uses humans to classify report texts manually. This, of course, takes quite a long time. In addition, the text structure is required so that the text of the complaint report can be categorized properly. Therefore, automatic classification needs to be done. In the available data set, the text category already has a label but not a subcategory, so in this study, the classification of categories and subcategories of grouping was carried out. This study uses the Recurrent Neural Network (RNN) deep learning classification method and clustering using Latent Dirichlet Allocation (LDA) topic modeling. For classification, two types of RNN units are observed, namely Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). Another problem with the dataset is an imbalanced dataset (unbalanced dataset), so it is necessary to do special handling using Synthetic Minority Over-Sampling Technique (SMOTE) and Class Weight. There is two-word embedding used, namely Word2Vec and FastText. Evaluation on classification uses accuracy and f1-score macro. For grouping, coherence topics are used to determine the number of clusters in the LDA then each cluster will generate keywords. To get the label automatically, a comparison is made using the cosine similarity between the LDA keywords and the significant term of the Term Frequency-Inverse Cluster Frequency (TFICF). In addition to determining the number of clusters, topic coherence is also used to determine cluster results. The data in this study used the distribution of 80% training data and 20% test data. Data validation uses 5-fold cross-validation. From the classification experiments, the best model was obtained through word embedding FastText and the GRU method with an evaluation result of 0.78 accuracy and f1-score macro of 0.52. For the results of the clustering evaluation, 20 categories are miraculous. The results of the evaluation of each category can be seen in Appendix C. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
The report is the text of public information regarding complaints in their area. It
needs to be classified into specific categories and sub-categories to make it easier
for the government to report the report. Currently, Jakarta Smart City still uses
humans to classify report texts manually. This, of course, takes quite a long time.
In addition, the text structure is required so that the text of the complaint report
can be categorized properly. Therefore, automatic classification needs to be done.
In the available data set, the text category already has a label but not a
subcategory, so in this study, the classification of categories and subcategories of
grouping was carried out.
This study uses the Recurrent Neural Network (RNN) deep learning classification
method and clustering using Latent Dirichlet Allocation (LDA) topic modeling. For
classification, two types of RNN units are observed, namely Bidirectional Long
Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). Another
problem with the dataset is an imbalanced dataset (unbalanced dataset), so it is
necessary to do special handling using Synthetic Minority Over-Sampling
Technique (SMOTE) and Class Weight. There is two-word embedding used,
namely Word2Vec and FastText. Evaluation on classification uses accuracy and
f1-score macro. For grouping, coherence topics are used to determine the number
of clusters in the LDA then each cluster will generate keywords. To get the label
automatically, a comparison is made using the cosine similarity between the LDA
keywords and the significant term of the Term Frequency-Inverse Cluster
Frequency (TFICF). In addition to determining the number of clusters, topic
coherence is also used to determine cluster results.
The data in this study used the distribution of 80% training data and 20% test data.
Data validation uses 5-fold cross-validation. From the classification experiments,
the best model was obtained through word embedding FastText and the GRU
method with an evaluation result of 0.78 accuracy and f1-score macro of 0.52. For
the results of the clustering evaluation, 20 categories are miraculous. The results
of the evaluation of each category can be seen in Appendix C. |
format |
Theses |
author |
Aruda Lisjana, Oktefvia |
spellingShingle |
Aruda Lisjana, Oktefvia CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT |
author_facet |
Aruda Lisjana, Oktefvia |
author_sort |
Aruda Lisjana, Oktefvia |
title |
CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT |
title_short |
CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT |
title_full |
CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT |
title_fullStr |
CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT |
title_full_unstemmed |
CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT |
title_sort |
classification and clustering to get the text structure of the citizen report |
url |
https://digilib.itb.ac.id/gdl/view/65934 |
_version_ |
1822932891112308736 |