Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting i...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Article |
Published: |
2023
|
Subjects: | |
Online Access: | https://repository.li.mahidol.ac.th/handle/123456789/82939 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Mahidol University |
id |
th-mahidol.82939 |
---|---|
record_format |
dspace |
spelling |
th-mahidol.829392023-06-04T00:07:31Z Synthesis of Tax Return Datasets for Development of Tax Evasion Detection Visitpanya N. Mahidol University Computer Science Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models. 2023-06-03T17:07:31Z 2023-06-03T17:07:31Z 2023-01-01 Article IEEE Access (2023) 10.1109/ACCESS.2023.3276761 21693536 2-s2.0-85160271058 https://repository.li.mahidol.ac.th/handle/123456789/82939 SCOPUS |
institution |
Mahidol University |
building |
Mahidol University Library |
continent |
Asia |
country |
Thailand Thailand |
content_provider |
Mahidol University Library |
collection |
Mahidol University Institutional Repository |
topic |
Computer Science |
spellingShingle |
Computer Science Visitpanya N. Synthesis of Tax Return Datasets for Development of Tax Evasion Detection |
description |
Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models. |
author2 |
Mahidol University |
author_facet |
Mahidol University Visitpanya N. |
format |
Article |
author |
Visitpanya N. |
author_sort |
Visitpanya N. |
title |
Synthesis of Tax Return Datasets for Development of Tax Evasion Detection |
title_short |
Synthesis of Tax Return Datasets for Development of Tax Evasion Detection |
title_full |
Synthesis of Tax Return Datasets for Development of Tax Evasion Detection |
title_fullStr |
Synthesis of Tax Return Datasets for Development of Tax Evasion Detection |
title_full_unstemmed |
Synthesis of Tax Return Datasets for Development of Tax Evasion Detection |
title_sort |
synthesis of tax return datasets for development of tax evasion detection |
publishDate |
2023 |
url |
https://repository.li.mahidol.ac.th/handle/123456789/82939 |
_version_ |
1781414428302901248 |