A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques

According to the Oman Education Portal (OEP), data set imbalances are common in student performance. Most of the students are performing welI, while only small cases of students are underperformed. Classification techniques for the imbalanced dataset can yield deceivingly high prediction accuracy. T...

Full description

Saved in:

Bibliographic Details
Main Author:	Sultan Alalawi, Sultan Juma
Format:	Thesis
Language:	English English
Published:	2021
Subjects:	L Education (General) QA273-280 Probabilities. Mathematical statistics
Online Access:	https://etd.uum.edu.my/10170/1/s902668_01.pdf https://etd.uum.edu.my/10170/2/s902668_02.pdf https://etd.uum.edu.my/10170/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Utara Malaysia
Language:	English English

id	my.uum.etd.10170
record_format	eprints
spelling	my.uum.etd.101702022-12-19T10:01:24Z https://etd.uum.edu.my/10170/ A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques Sultan Alalawi, Sultan Juma L Education (General) QA273-280 Probabilities. Mathematical statistics According to the Oman Education Portal (OEP), data set imbalances are common in student performance. Most of the students are performing welI, while only small cases of students are underperformed. Classification techniques for the imbalanced dataset can yield deceivingly high prediction accuracy. The majority class usually drives the overall predictive accuracy at the expense of having abysmal performance on the minority class. The main objective of this study was to predict students' performance which consisted of imbalanced class distribution, by exploiting different sampling techniques and several data mining classifier models. Three main sampling techniques - synthetic minority over-sampling technique (SMOTE), random under-sampling (RUS), and clustering-based sampling were compared to improve the predictive accuracy in the minority class while maintaining satisfactory overall classification performance. Five different data-mining classifiers - J48, Random Forest, K-Nearest Neighbour, Naïve Bayes, and Logistic Regression were used to predict the student performance. 10-fold cross-validation was utilized to minimize the sampling bias. The classifiers' performance was evaluated using four metrics: accuracy, False Positive (FP), Matthews correlation coefficient (MCC), and Receiver Operating Characteristic (ROC). The OEP datasets between 2018 and 2019 were extracted to assess the efficacy of both sampling techniques and classification methods. The results indicated that the K-Nearest Neighbors combined with the clustering-based sampling technique produced the best classification performance with an MCC value of 98.4% on the 10-fold crossvalidation. The clustering-based sampling techniques improved the overall prediction performance for the minority class. In addition, the most important variables to accurately predict student performance were identified by utilizing the Random Forest model. OEP contains a large amount of data and analyses based on this large and complex data can be useful for OEP stakeholders in improving student performance and identifying students who require additional attention. 2021 Thesis NonPeerReviewed text en https://etd.uum.edu.my/10170/1/s902668_01.pdf text en https://etd.uum.edu.my/10170/2/s902668_02.pdf Sultan Alalawi, Sultan Juma (2021) A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques. Doctoral thesis, Universiti Utara Malaysia.
institution	Universiti Utara Malaysia
building	UUM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Utara Malaysia
content_source	UUM Electronic Theses
url_provider	http://etd.uum.edu.my/
language	English English
topic	L Education (General) QA273-280 Probabilities. Mathematical statistics
spellingShingle	L Education (General) QA273-280 Probabilities. Mathematical statistics Sultan Alalawi, Sultan Juma A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques
description	According to the Oman Education Portal (OEP), data set imbalances are common in student performance. Most of the students are performing welI, while only small cases of students are underperformed. Classification techniques for the imbalanced dataset can yield deceivingly high prediction accuracy. The majority class usually drives the overall predictive accuracy at the expense of having abysmal performance on the minority class. The main objective of this study was to predict students' performance which consisted of imbalanced class distribution, by exploiting different sampling techniques and several data mining classifier models. Three main sampling techniques - synthetic minority over-sampling technique (SMOTE), random under-sampling (RUS), and clustering-based sampling were compared to improve the predictive accuracy in the minority class while maintaining satisfactory overall classification performance. Five different data-mining classifiers - J48, Random Forest, K-Nearest Neighbour, Naïve Bayes, and Logistic Regression were used to predict the student performance. 10-fold cross-validation was utilized to minimize the sampling bias. The classifiers' performance was evaluated using four metrics: accuracy, False Positive (FP), Matthews correlation coefficient (MCC), and Receiver Operating Characteristic (ROC). The OEP datasets between 2018 and 2019 were extracted to assess the efficacy of both sampling techniques and classification methods. The results indicated that the K-Nearest Neighbors combined with the clustering-based sampling technique produced the best classification performance with an MCC value of 98.4% on the 10-fold crossvalidation. The clustering-based sampling techniques improved the overall prediction performance for the minority class. In addition, the most important variables to accurately predict student performance were identified by utilizing the Random Forest model. OEP contains a large amount of data and analyses based on this large and complex data can be useful for OEP stakeholders in improving student performance and identifying students who require additional attention.
format	Thesis
author	Sultan Alalawi, Sultan Juma
author_facet	Sultan Alalawi, Sultan Juma
author_sort	Sultan Alalawi, Sultan Juma
title	A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques
title_short	A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques
title_full	A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques
title_fullStr	A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques
title_full_unstemmed	A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques
title_sort	new framework in improving prediction of class imbalance for student performance in oman educational dataset using clustering based sampling techniques
publishDate	2021
url	https://etd.uum.edu.my/10170/1/s902668_01.pdf https://etd.uum.edu.my/10170/2/s902668_02.pdf https://etd.uum.edu.my/10170/
_version_	1753791153797332992

A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques

Similar Items