CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT

The report is the text of public information regarding complaints in their area. It needs to be classified into specific categories and sub-categories to make it easier for the government to report the report. Currently, Jakarta Smart City still uses humans to classify report texts manually. This...

Full description

Saved in:

Bibliographic Details
Main Author:	Aruda Lisjana, Oktefvia
Format:	Theses
Language:	Indonesia
Online Access:	https://digilib.itb.ac.id/gdl/view/65934
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Institut Teknologi Bandung
Language:	Indonesia

id	id-itb.:65934
spelling	id-itb.:659342022-06-25T23:36:39ZCLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT Aruda Lisjana, Oktefvia Indonesia Theses classification, clustering, complaint text, deep learning, Latent Dirichlet Allocation, Term Frequency-Inverse Cluster Frequency INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/65934 The report is the text of public information regarding complaints in their area. It needs to be classified into specific categories and sub-categories to make it easier for the government to report the report. Currently, Jakarta Smart City still uses humans to classify report texts manually. This, of course, takes quite a long time. In addition, the text structure is required so that the text of the complaint report can be categorized properly. Therefore, automatic classification needs to be done. In the available data set, the text category already has a label but not a subcategory, so in this study, the classification of categories and subcategories of grouping was carried out. This study uses the Recurrent Neural Network (RNN) deep learning classification method and clustering using Latent Dirichlet Allocation (LDA) topic modeling. For classification, two types of RNN units are observed, namely Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). Another problem with the dataset is an imbalanced dataset (unbalanced dataset), so it is necessary to do special handling using Synthetic Minority Over-Sampling Technique (SMOTE) and Class Weight. There is two-word embedding used, namely Word2Vec and FastText. Evaluation on classification uses accuracy and f1-score macro. For grouping, coherence topics are used to determine the number of clusters in the LDA then each cluster will generate keywords. To get the label automatically, a comparison is made using the cosine similarity between the LDA keywords and the significant term of the Term Frequency-Inverse Cluster Frequency (TFICF). In addition to determining the number of clusters, topic coherence is also used to determine cluster results. The data in this study used the distribution of 80% training data and 20% test data. Data validation uses 5-fold cross-validation. From the classification experiments, the best model was obtained through word embedding FastText and the GRU method with an evaluation result of 0.78 accuracy and f1-score macro of 0.52. For the results of the clustering evaluation, 20 categories are miraculous. The results of the evaluation of each category can be seen in Appendix C. text
institution	Institut Teknologi Bandung
building	Institut Teknologi Bandung Library
continent	Asia
country	Indonesia Indonesia
content_provider	Institut Teknologi Bandung
collection	Digital ITB
language	Indonesia
description	The report is the text of public information regarding complaints in their area. It needs to be classified into specific categories and sub-categories to make it easier for the government to report the report. Currently, Jakarta Smart City still uses humans to classify report texts manually. This, of course, takes quite a long time. In addition, the text structure is required so that the text of the complaint report can be categorized properly. Therefore, automatic classification needs to be done. In the available data set, the text category already has a label but not a subcategory, so in this study, the classification of categories and subcategories of grouping was carried out. This study uses the Recurrent Neural Network (RNN) deep learning classification method and clustering using Latent Dirichlet Allocation (LDA) topic modeling. For classification, two types of RNN units are observed, namely Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). Another problem with the dataset is an imbalanced dataset (unbalanced dataset), so it is necessary to do special handling using Synthetic Minority Over-Sampling Technique (SMOTE) and Class Weight. There is two-word embedding used, namely Word2Vec and FastText. Evaluation on classification uses accuracy and f1-score macro. For grouping, coherence topics are used to determine the number of clusters in the LDA then each cluster will generate keywords. To get the label automatically, a comparison is made using the cosine similarity between the LDA keywords and the significant term of the Term Frequency-Inverse Cluster Frequency (TFICF). In addition to determining the number of clusters, topic coherence is also used to determine cluster results. The data in this study used the distribution of 80% training data and 20% test data. Data validation uses 5-fold cross-validation. From the classification experiments, the best model was obtained through word embedding FastText and the GRU method with an evaluation result of 0.78 accuracy and f1-score macro of 0.52. For the results of the clustering evaluation, 20 categories are miraculous. The results of the evaluation of each category can be seen in Appendix C.
format	Theses
author	Aruda Lisjana, Oktefvia
spellingShingle	Aruda Lisjana, Oktefvia CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT
author_facet	Aruda Lisjana, Oktefvia
author_sort	Aruda Lisjana, Oktefvia
title	CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT
title_short	CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT
title_full	CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT
title_fullStr	CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT
title_full_unstemmed	CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT
title_sort	classification and clustering to get the text structure of the citizen report
url	https://digilib.itb.ac.id/gdl/view/65934
_version_	1822932891112308736

CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT

Similar Items