An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability

In this paper, we investigate the impact of pattern leakage during data preprocessing on the reliability of Machine Learning (ML) based intrusion detection systems (IDS). Data leakage, also known as pattern leakage, occurs during data preprocessing when information from the testing set is used in tr...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	Bouke, Mohamed Aly, Abdullah, Azizol
التنسيق:	مقال
منشور في:	Elsevier B.V. 2023
الوصول للمادة أونلاين:	http://psasir.upm.edu.my/id/eprint/106552/ https://linkinghub.elsevier.com/retrieve/pii/S0957417423012174
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Universiti Putra Malaysia

id	my.upm.eprints.106552
record_format	eprints
spelling	my.upm.eprints.1065522024-10-03T04:25:37Z http://psasir.upm.edu.my/id/eprint/106552/ An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability Bouke, Mohamed Aly Abdullah, Azizol In this paper, we investigate the impact of pattern leakage during data preprocessing on the reliability of Machine Learning (ML) based intrusion detection systems (IDS). Data leakage, also known as pattern leakage, occurs during data preprocessing when information from the testing set is used in training, leading to overfitting and inflated accuracy scores. Our study uses three well-known intrusion detection datasets: NSL-KDD, UNSW-NB15, and KDDCUP99. We preprocess the data to create versions with and without pattern leakage and train and test six ML models: Decision Tree (DT), Gradient Boosting (GB), K-neighbours (KNN), Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR). Our results show that building IDS models with data leakage leads to higher accuracy but is unreliable. Additionally, we find that some algorithms are more sensitive to data leakage than others, as seen by the drop in model accuracy when built without leakage. To address this problem, we provide suggestions for mitigating data leakage in the training process and analyzing the sensitivity of different algorithms. Overall, our study emphasizes the importance of addressing data leakage in the training process to ensure the reliability of ML-based IDS models. Elsevier B.V. 2023 Article PeerReviewed Bouke, Mohamed Aly and Abdullah, Azizol (2023) An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability. Expert Systems with Applications, 230. pp. 1-9. ISSN 0957-4174; ESSN: 1873-6793 https://linkinghub.elsevier.com/retrieve/pii/S0957417423012174 10.1016/j.eswa.2023.120715
institution	Universiti Putra Malaysia
building	UPM Library
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Putra Malaysia
content_source	UPM Institutional Repository
url_provider	http://psasir.upm.edu.my/
description	In this paper, we investigate the impact of pattern leakage during data preprocessing on the reliability of Machine Learning (ML) based intrusion detection systems (IDS). Data leakage, also known as pattern leakage, occurs during data preprocessing when information from the testing set is used in training, leading to overfitting and inflated accuracy scores. Our study uses three well-known intrusion detection datasets: NSL-KDD, UNSW-NB15, and KDDCUP99. We preprocess the data to create versions with and without pattern leakage and train and test six ML models: Decision Tree (DT), Gradient Boosting (GB), K-neighbours (KNN), Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR). Our results show that building IDS models with data leakage leads to higher accuracy but is unreliable. Additionally, we find that some algorithms are more sensitive to data leakage than others, as seen by the drop in model accuracy when built without leakage. To address this problem, we provide suggestions for mitigating data leakage in the training process and analyzing the sensitivity of different algorithms. Overall, our study emphasizes the importance of addressing data leakage in the training process to ensure the reliability of ML-based IDS models.
format	Article
author	Bouke, Mohamed Aly Abdullah, Azizol
spellingShingle	Bouke, Mohamed Aly Abdullah, Azizol An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
author_facet	Bouke, Mohamed Aly Abdullah, Azizol
author_sort	Bouke, Mohamed Aly
title	An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
title_short	An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
title_full	An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
title_fullStr	An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
title_full_unstemmed	An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
title_sort	empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
publisher	Elsevier B.V.
publishDate	2023
url	http://psasir.upm.edu.my/id/eprint/106552/ https://linkinghub.elsevier.com/retrieve/pii/S0957417423012174
_version_	1814054606542471168

An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability

مواد مشابهة