Synthesis of Tax Return Datasets for Development of Tax Evasion Detection

Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting i...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Visitpanya N.
مؤلفون آخرون: Mahidol University
التنسيق: مقال
منشور في: 2023
الموضوعات:
الوصول للمادة أونلاين:https://repository.li.mahidol.ac.th/handle/123456789/82939
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة: Mahidol University
الوصف
الملخص:Datasets are an essential part of data science processes. However, retrieving a dataset, especially a tax return dataset, is challenging as privacy becomes more evident in our daily lives. Thus, data synthesis is an approach selected for our work by utilizing publicly available data and augmenting it using Generative Adversarial Network (GAN) and Synthetic Minority Oversampling TEchnique (SMOTE). The evaluation is performed using a correlation matrix, Principal Component Analysis (PCA), and quality score. In addition, fundamental machine learning models are utilized to detect tax evasion based on a literature review. The data are gathered from the financial statements of companies registered within the Stock Exchange of Thailand (SET). Our results indicate that synthetic datasets with 0.86 average quality score can train models that yield approximately 0.95 Accuracy and 0.93 F1-Score. Additionally, by increasing more instances, the effect of class imbalance and high variance can be mitigated. The expected benefits include the use of open data for analysis and application of synthetic datasets. Forthcoming research could consider the statistical behavior of different business sectors, multiclass labeling for advanced recommendations, and implementation of unsupervised models.