SIMPLIFICATION OF CORRESPONDENCE ANALYSIS TO INCREASE THE PRINCIPAL COORDINATES ACCURACY
Correspondence analysis produces a map that provides information about the interaction between two categorical variables, based on the dependence between the category of rows, columns, or both. The open problems of the conventionally correspondence analysis are eigenvalues obtained by numerical proc...
Saved in:
Main Author: | |
---|---|
Format: | Dissertations |
Language: | Indonesia |
Subjects: | |
Online Access: | https://digilib.itb.ac.id/gdl/view/32727 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Institut Teknologi Bandung |
Language: | Indonesia |
Summary: | Correspondence analysis produces a map that provides information about the interaction between two categorical variables, based on the dependence between the category of rows, columns, or both. The open problems of the conventionally correspondence analysis are eigenvalues obtained by numerical process, long calculation phases, the data are divided into several files, and there are often incomplete and inaccurate data. The simplification of correspondence analysis calculates the principal coordinates direct from the frequency data in the cross tabulation matrix, so as not to use numerical process and minimize rounding values. Simplification also occurs when the calculation of some data and the results are relatively similar to the results of complete data.
Mathematical problems in the correspondence analysis which is solved in this study are: 1) Determine the principal coordinates are simpler and more accurate than the methods that have been used (conventional). 2) Determine the representative principal coordinates based on the samples form the large-dimensional data. 3) Conducting a case study based on inter-city bus passenger data. Based on the problems, there are obtained several new methods. The methods for: 1) Determine the more accurate principal coordinates of ????×???? or ????×???? contingency table, with ????=2,3 and ????,?????2 called SoCA (simplification of correspondence analysis). 2) Obtain sample size from complete data using numerical random sampling (NRS). 3) Obtain the correction matrix were used to modify the sample correspondence matrix so that it is relatively similar to the complete data.
The eigenvalue analysis obtained the theorem that ensures that 0 is the eigenvalue of the correspondence analysis calculation, so the calculation of eigenvalue from 2×????, ????×2, 3×????, or ????×3 contingency table can be done analytically. The determination of the principal coordinates using the SoCA from 2×???? or ????×2 contingency table is simpler, and the results are more accurate. If ???? is a 2×????, or ????×2 contingency table constructed from two categorical variables, the principal coordinates from SoCA is more accurate because it is computed directly from the elements of N. Let ???? is ????×???? standard residual matrix, which represents the association between categories, the singular value and the left singular vector of ???? are derived based on the eigenvalue and eigenvector of ????=???????????? for ????????? or ????=???????????? for ????>????. If ???? is a 3×????, or ????×3 contingency table, then ???? can be computed directly from the elements of ????. The principal coordinates of the contingency table is 3×????, or ????×3 using SoCA is simpler, and the results are more accurate, because directly calculated from the elements of ???? (not using numerical processes).
The need for large data storage, long processing times, and costly data recovery processes, requires researchers to analyze data from samples. The conventional sampling technique for large data is using simple random sampling (SRS), but requires assumption of distributions, while in dissertation introduces the assumption-free sampling technique called NRS. NRS results are consistent although the property of the estimated matrix (marginal distribution, dependence, and size) is different. Researchers often using SRS with the subjectively margin of error, then most likely the estimation will be bias, so the use of NRS is more recommended. In addition to estimating the matrix, the NRS can also use to estimate vectors and scalar from the complete data. The flexibility of NRS will be help the calculation on any type of statistical analysis.
The dissertation also adds a method to modify the sample correspondence matrix ????????, so the results are similar to the complete data results. Let ???? be the complete data correspondence matrix. ????? ???? is the estimator of ????, and ???? is the correction matrix, so it is shown that ????? ????????? with ????? ????=???????????? if ????????? and ????? ????=???????????? if ????>????. Each element of the correction matrix ???? is the distribution estimate value of each correction matrix element from ???? sample set, that is ????????,????=1,?,????. The correction matrix for each samples set is calculated by ????????=????????????,?????????????????????1 if ????????? or ????????=????????,?????????????????1???? if ????>????.
Based on the research that has been done and the case study on the data of Bandung-Cirebon inter-city bus passengers, so get recommendations for the management of the inter-city bus company. 76,7% of passengers came from the city of Bandung and Cirebon. Passengers from Bandung are dependent on private employees, and from Cirebon are with students. Recommendations for company management are: 1) Provide the convenience for private employees from the city of Bandung, for example, the convenience to get tickets and feeders, or provide long-term parking space in the pool bus, although it requires additional cost. 2) Provide the appropriate services for students from Cirebon, for example, provide ticket prices discounts for students. |
---|