Synthesis of Tax Return Datasets for Development of Tax Evasion Detection

Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting i...

Full description

Saved in:
Bibliographic Details
Main Author: Visitpanya N.
Other Authors: Mahidol University
Format: Article
Published: 2023
Subjects:
Online Access:https://repository.li.mahidol.ac.th/handle/123456789/82939
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Mahidol University
id th-mahidol.82939
record_format dspace
spelling th-mahidol.829392023-06-04T00:07:31Z Synthesis of Tax Return Datasets for Development of Tax Evasion Detection Visitpanya N. Mahidol University Computer Science Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models. 2023-06-03T17:07:31Z 2023-06-03T17:07:31Z 2023-01-01 Article IEEE Access (2023) 10.1109/ACCESS.2023.3276761 21693536 2-s2.0-85160271058 https://repository.li.mahidol.ac.th/handle/123456789/82939 SCOPUS
institution Mahidol University
building Mahidol University Library
continent Asia
country Thailand
Thailand
content_provider Mahidol University Library
collection Mahidol University Institutional Repository
topic Computer Science
spellingShingle Computer Science
Visitpanya N.
Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
description Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models.
author2 Mahidol University
author_facet Mahidol University
Visitpanya N.
format Article
author Visitpanya N.
author_sort Visitpanya N.
title Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
title_short Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
title_full Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
title_fullStr Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
title_full_unstemmed Synthesis of Tax Return Datasets for Development of Tax Evasion Detection
title_sort synthesis of tax return datasets for development of tax evasion detection
publishDate 2023
url https://repository.li.mahidol.ac.th/handle/123456789/82939
_version_ 1781414428302901248