CLASSIFICATION AND CLUSTERING TO GET THE TEXT STRUCTURE OF THE CITIZEN REPORT

The report is the text of public information regarding complaints in their area. It needs to be classified into specific categories and sub-categories to make it easier for the government to report the report. Currently, Jakarta Smart City still uses humans to classify report texts manually. This...

Full description

Saved in:
Bibliographic Details
Main Author: Aruda Lisjana, Oktefvia
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/65934
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The report is the text of public information regarding complaints in their area. It needs to be classified into specific categories and sub-categories to make it easier for the government to report the report. Currently, Jakarta Smart City still uses humans to classify report texts manually. This, of course, takes quite a long time. In addition, the text structure is required so that the text of the complaint report can be categorized properly. Therefore, automatic classification needs to be done. In the available data set, the text category already has a label but not a subcategory, so in this study, the classification of categories and subcategories of grouping was carried out. This study uses the Recurrent Neural Network (RNN) deep learning classification method and clustering using Latent Dirichlet Allocation (LDA) topic modeling. For classification, two types of RNN units are observed, namely Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU). Another problem with the dataset is an imbalanced dataset (unbalanced dataset), so it is necessary to do special handling using Synthetic Minority Over-Sampling Technique (SMOTE) and Class Weight. There is two-word embedding used, namely Word2Vec and FastText. Evaluation on classification uses accuracy and f1-score macro. For grouping, coherence topics are used to determine the number of clusters in the LDA then each cluster will generate keywords. To get the label automatically, a comparison is made using the cosine similarity between the LDA keywords and the significant term of the Term Frequency-Inverse Cluster Frequency (TFICF). In addition to determining the number of clusters, topic coherence is also used to determine cluster results. The data in this study used the distribution of 80% training data and 20% test data. Data validation uses 5-fold cross-validation. From the classification experiments, the best model was obtained through word embedding FastText and the GRU method with an evaluation result of 0.78 accuracy and f1-score macro of 0.52. For the results of the clustering evaluation, 20 categories are miraculous. The results of the evaluation of each category can be seen in Appendix C.