Machine learning techniques for advanced cyber attack detection

With the development of information communication technologies (ICT), more and more data is generated, processed, and transmitted among different smart components and organizations. ICT brings convenience and opportunities to humans and society, but at the same time, the resulting security-critical...

Full description

Saved in:
Bibliographic Details
Main Author: Yang, Wenzhuo
Other Authors: Lam Kwok Yan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/161429
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:With the development of information communication technologies (ICT), more and more data is generated, processed, and transmitted among different smart components and organizations. ICT brings convenience and opportunities to humans and society, but at the same time, the resulting security-critical and privacy-sensitive data attracts more attackers and increases the likelihood of security incidents. Therefore, studying effective and practical techniques against cyber attacks and maintaining cybersecurity in the big data society become increasingly significant for enhancing the confidentiality, integrity, and availability of user data in cyberspace. Cybersecurity puts a lot of emphasis on detection, reaction, and protection measures. As one of the key steps to defend against cyber attacks, cyber attack detection plays a critical role in cybersecurity posture supervision and threat warning. We mainly focus on investigating suitable machine learning (ML) techniques to construct advanced cyber attack detection systems in this thesis. Specifically, we focus on exploring promising ML techniques to design effective intrusion detection systems (IDS) and efficient cyber threat intelligence (CTI) analysis models to realize proactive defense to cyber attacks. As one of the most significant cybersecurity detection tools, IDS can identify anomalous activities based on internal system data, reducing financial and reputational losses caused by cyber attacks. Many ML techniques have been utilized to automate the intrusion detection process. However, most existing ML-based IDSs suffer practical issues in real industrial circumstances. Problems include the high cost of acquiring fully correctly labeled (FCL) data under the challenge of big data and unsatisfactory detection accuracy for minority attacks in imbalanced data. Therefore, we explore the possibility of training IDS by weakly supervised learning (WSL) approaches using weak labels (incomplete, inexact, or possibly inaccurate labels) to mitigate the data annotation pressure and data privacy issues that the traditional ML-IDS may face. WSL is a special ML paradigm and weak labels are imperfect, high-level annotation information which is easier and cheaper to obtain than FCL data in reality. First, we utilize a promising WSL technique, unlabeled-unlabeled learning (UUL), to train IDS for identifying benign and malicious network traffic by inexact information labeled data. Then, we investigate the feasibility of using another WSL archetype, partial label learning (PLL) to build IDS by ambiguously labeled data. Several different PLL techniques are leveraged and various data resampling algorithms are combined with the proposed IDS model to detect specific attacks and improve the detection performance for minority attacks in imbalanced data. CTI analysis is another promising method that enables security experts to grasp emerging threat trends based on external sources and provide targeted users with early warnings to take proactive countermeasures to detect and against potential cyber attacks. As cyber attacks are increasingly sophisticated and menacing, it becomes a global trend to share and analyze CTI between different security departments. More CTI reports generation and frequent CTI sharing lead to data redundancy problems and cause an urgent need for much higher analysis efficiency capacity. Lacking professional security analysts and the increasing capability of capturing network information in the big data society are another two challenges. Facing the above problems, we want to speed up the CTI analysis process and automate the CTI reports classification through data mining and machine learning techniques. Hence, our third work presents a practical and efficient approach for gathering large quantities of CTI sources, embedding, and grouping the CTI reports by unsupervised text representation algorithms jointly with six ML classifiers to automate the CTI analysis process. In conclusion, we leverage imperfect label trained ML techniques for internal network intrusion detection and use generic feature representation tools jointly with different ML classifiers for external CTI data analysis to enhance cybersecurity. Extensive experiments show the feasibility and effectiveness of the proposed methods for advanced cyber attack detection.