Synthetic multivariate data generation procedure with various outlier scenarios using R programming language
A synthetic data generation procedure is a procedure to generate data from either a statistical or mathematical model. The data generation procedure has been used in simulation studies to compare statistical performance methods or propose a new statistical method with a specific distribution. A synt...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Penerbit Universiti Teknologi Malaysia
2022
|
Subjects: | |
Online Access: | http://umpir.ump.edu.my/id/eprint/33887/1/2022%20Mutalib%20et%20al%20jurnal%20teknologi.pdf http://umpir.ump.edu.my/id/eprint/33887/ https://doi.org/10.11113/jurnalteknologi.v84.17900 https://doi.org/10.11113/jurnalteknologi.v84.17900 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Universiti Malaysia Pahang |
Language: | English |
Summary: | A synthetic data generation procedure is a procedure to generate data from either a statistical or mathematical model. The data generation procedure has been used in simulation studies to compare statistical performance methods or propose a new statistical method with a specific distribution. A synthetic multivariate data generation procedure with various outlier scenarios using R is formulated in this study. An outlier generating model is used to generate multivariate data that contains outliers. Data generation procedures for various outlier scenarios
by using R are explained. Three outlier scenarios are produced, and graphical representations using 3D scatterplot and Chernoff faces for these outlier scenarios are shown. The graphical
representation shows that as the distance between outliers and inliers by shifting the mean, increases in Outlier Scenario 1, the outliers and inliers are completely separated. The same pattern can also be seen when the distance between outliers and inliers, by shifting the covariance, increase in Outlier Scenario 2. For Outlier Scenario 3, when both values and increase, the separation of outliers and inliers are more apparent. The data generation procedure in this study will be continually used in other applications, such as identifying outliers by using the
clustering method. |
---|