FRAUD DETECTION IN FINANCIAL TRANSACTIONS USING LEAST-SQUARES EXPECTATION AND QUANTILE-BASED PROBABILISTIC SUPPORT VECTOR MACHINE WITH SYMBOLIC DATA

The financial system still plays a very important role in meeting human needs. The use of digital and non-digital financial systems is not without gaps in the security system. One of the biggest threats is fraud in digital transactions which can cause huge losses. Currently, industries have imple...

Full description

Saved in:
Bibliographic Details
Main Author: Hidajat, Christovito
Format: Theses
Language:Indonesia
Online Access:https://digilib.itb.ac.id/gdl/view/76541
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Institut Teknologi Bandung
Language: Indonesia
Description
Summary:The financial system still plays a very important role in meeting human needs. The use of digital and non-digital financial systems is not without gaps in the security system. One of the biggest threats is fraud in digital transactions which can cause huge losses. Currently, industries have implemented fraud detection systems using machine learning. One of the popularly used machine learning algorithms for classification is the support vector machine (SVM) because it can work both on structured and unstructured data, is effective for use on high-dimensional data, and is flexible in kernel usage. However, SVM is not optimal when applied to large datasets such as fraud because solving quadratic programming (QP) problems requires large memory and computation time. Meanwhile, the prediction time of the model is affected by the complexity of the model which depends on hyperparameters such as the type of kernel and the size of the data itself. This is important because fraud detection systems rely heavily on model computational speed to predict fraud transactions in near real-time and maximize user experience. To overcome the problem of computational time, there is a least-squares SVM method which only solves linear equation problems, thereby reducing computational complexity, especially for large data. In addition, one alternative data reduction technique is classical data aggregation into symbolic data, such as numerical histograms and categories. If classical data only stores one data value, symbolic data will have additional information, for example histogram data can store the range of values into bins and the frequency of their occurrence. So, it is hoped that this technique can store as much information as possible with the smallest possible volume which will reduce memory and computation time. Al-Ma'shumah et al. (2022) modified the probabilistic support vector machine (PSVM) model by Abaszade et al. (2018) as expectation-based probabilistic SVM (EPSVM) and quantile-based probabilistic SVM (QPSVM). This model can be applied to histogram data to produce expectations and quantile representations. In this research, a machine learning-based fraud detection system will be created using the least-squares-based SVM, EPSVM, and QPSVM algorithms. The development is carried out through the CRISP-DM stages which begin with exploratory data analysis (EDA) and data preparation which includes random undersampling, feature selection, and transformation of classical data into histogram data. Then, modeling and experiments were carried out using hyperparameters such as Nmember, binning method, kernel, and p-quantile values to be evaluated using several metrics such as recall, FPR, AUC, as well as computational time consisting of training time and prediction time. As a result, in the Standard LS-SVM, the highest evaluation metric value is owned by the RBF kernel with a recall of 0.900, an FPR of 0.012, an AUC of 0.944, a training time of 5144 seconds, and a prediction time of 7.5 seconds. In the best LS-PSVM model, choosing the QPSVM algorithm p = 0.5 with a Polynomial kernel, Nmember = 5, and the binning = Doane method resulted in a recall of 0.860, an FPR of 0.039, an AUC of 0.910, a training time of 405 seconds and a prediction time of 2 seconds. Evaluation metrics of the best LS-PSVM are still lower than those on the Standard LS-SVM model. This is because there is missing information from the transformation process from classical data to symbolic data representation. However, the lower training and prediction computation time will save computational operational costs, allow for more frequent model updates and experiments, and speed up real-time detection of fraud transactions and response times thereby improving the user experience of the financial system.