TEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING
To ensure accountability in the proper distribution of electricity subsidies, PT PLN (Persero) is required to periodically report the cost components incurred in delivering electricity distribution, referred to as the Basic Cost of Electricity Supply (Biaya Pokok Penyediaan or BPP) components, to...
Saved in:
Main Author: | |
---|---|
Format: | Theses |
Language: | Indonesia |
Online Access: | https://digilib.itb.ac.id/gdl/view/86928 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
id |
id-itb.:86928 |
---|---|
spelling |
id-itb.:869282025-01-07T08:38:02ZTEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING Nirmalasari, Listyani Indonesia Theses CNN, Random Forest, SMOTE, Text classification INSTITUT TEKNOLOGI BANDUNG https://digilib.itb.ac.id/gdl/view/86928 To ensure accountability in the proper distribution of electricity subsidies, PT PLN (Persero) is required to periodically report the cost components incurred in delivering electricity distribution, referred to as the Basic Cost of Electricity Supply (Biaya Pokok Penyediaan or BPP) components, to the government. In this process, PT PLN (Persero) must identify BPP components (also known as Allowable Costs) and non-BPP components (also known as Non-Allowable Costs) from financial transactions stored in the company’s financial system. Currently, the classification of BPP and non-BPP components is conducted manually, which requires significant resources. The financial transaction data used for this identification consists of account codes and transaction description texts. To enhance efficiency in the process of identifying BPP and non-BPP components in large financial transaction datasets, the author proposes a machine learningbased text classification model. The data used will include financial transactions from January to December 2023 for model development and evaluation, and transactions from January to March 2024 for predictions using the developed model. The data is categorized into three classes: AC, NAC, and PROP (with the PROP class representing transactions with a proportional value relative to NAC). The financial transaction data contains unstructured free text, characterized by diverse formats, a mix of formal and informal language, and the use of abbreviations, necessitating preprocessing before analysis. The preprocessing steps to be employed include case folding, noise removal, tokenization, stop word removal, spell checking, and word representation. Furthermore, the transaction data used in this study exhibits imbalanced data characteristics, where the dataset's classes are unevenly distributed. This requires additional methods to address potential bias toward the majority class in machine learning results. To address this issue, the Synthetic Minority Oversampling Technique (SMOTE) is applied to improve the accuracy of machine learning predictions. The study involves two scenarios to compare model performance: the first scenario uses the Random Forest method, while the second scenario uses the CNN method. Findings reveal that the SMOTE-Random Forest model outperforms the SMOTECNN model, achieving an accuracy of 97% and an AUC of 0.9871. When applied to new data, the model demonstrates an accuracy of 85%. The implementation of machine learning for classifying financial transactions to determine BPP and NonBPP components in electricity subsidies significantly improves time efficiency. The model can process data faster than manual classification methods. text |
institution |
Institut Teknologi Bandung |
building |
Institut Teknologi Bandung Library |
continent |
Asia |
country |
Indonesia Indonesia |
content_provider |
Institut Teknologi Bandung |
collection |
Digital ITB |
language |
Indonesia |
description |
To ensure accountability in the proper distribution of electricity subsidies, PT PLN
(Persero) is required to periodically report the cost components incurred in
delivering electricity distribution, referred to as the Basic Cost of Electricity Supply
(Biaya Pokok Penyediaan or BPP) components, to the government. In this process,
PT PLN (Persero) must identify BPP components (also known as Allowable Costs)
and non-BPP components (also known as Non-Allowable Costs) from financial
transactions stored in the company’s financial system. Currently, the classification
of BPP and non-BPP components is conducted manually, which requires
significant resources. The financial transaction data used for this identification
consists of account codes and transaction description texts.
To enhance efficiency in the process of identifying BPP and non-BPP components
in large financial transaction datasets, the author proposes a machine learningbased text classification model. The data used will include financial transactions
from January to December 2023 for model development and evaluation, and
transactions from January to March 2024 for predictions using the developed
model. The data is categorized into three classes: AC, NAC, and PROP (with the
PROP class representing transactions with a proportional value relative to NAC).
The financial transaction data contains unstructured free text, characterized by
diverse formats, a mix of formal and informal language, and the use of
abbreviations, necessitating preprocessing before analysis. The preprocessing
steps to be employed include case folding, noise removal, tokenization, stop word
removal, spell checking, and word representation.
Furthermore, the transaction data used in this study exhibits imbalanced data
characteristics, where the dataset's classes are unevenly distributed. This requires
additional methods to address potential bias toward the majority class in machine
learning results. To address this issue, the Synthetic Minority Oversampling
Technique (SMOTE) is applied to improve the accuracy of machine learning
predictions.
The study involves two scenarios to compare model performance: the first scenario
uses the Random Forest method, while the second scenario uses the CNN method. Findings reveal that the SMOTE-Random Forest model outperforms the SMOTECNN model, achieving an accuracy of 97% and an AUC of 0.9871. When applied
to new data, the model demonstrates an accuracy of 85%. The implementation of
machine learning for classifying financial transactions to determine BPP and NonBPP components in electricity subsidies significantly improves time efficiency. The
model can process data faster than manual classification methods.
|
format |
Theses |
author |
Nirmalasari, Listyani |
spellingShingle |
Nirmalasari, Listyani TEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING |
author_facet |
Nirmalasari, Listyani |
author_sort |
Nirmalasari, Listyani |
title |
TEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING |
title_short |
TEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING |
title_full |
TEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING |
title_fullStr |
TEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING |
title_full_unstemmed |
TEXT CLASSIFICATION FOR IDENTIFYING COST COMPONENTS IN FINANCIAL TRANSACTIONS OF ELECTRICITY SUPPLY USING MACHINE LEARNING |
title_sort |
text classification for identifying cost components in financial transactions of electricity supply using machine learning |
url |
https://digilib.itb.ac.id/gdl/view/86928 |
_version_ |
1822999733075968000 |