Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting i...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Article |
Published: |
2023
|
Subjects: | |
Online Access: | https://repository.li.mahidol.ac.th/handle/123456789/82939 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Mahidol University |
Summary: | Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models. |
---|