Investigation on effective solutions against insider attacks

One of the common flaws of the current insider threat detection is the high demand for data storage. This report investigates the effectiveness of dimensionality reduction techniques in reducing this high demand needed by the machine learning methods used for insider threat detection. The dimensiona...

Full description

Saved in:
Bibliographic Details
Main Author: Ang, Jun Hao
Other Authors: Felicity Chan
Format: Final Year Project
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/74243
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:One of the common flaws of the current insider threat detection is the high demand for data storage. This report investigates the effectiveness of dimensionality reduction techniques in reducing this high demand needed by the machine learning methods used for insider threat detection. The dimensionality reduction techniques discussed in this report are feature selection methods i.e. Recursive Feature Elimination (RFE), Chi-Square Test and feature extraction methods i.e. Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA). The machine learning algorithms discussed in this report are supervised method i.e. K-Nearest Neighbour (KNN) and unsupervised method i.e. K-Means Clustering (KMC). The dataset used is a labelled phishing website dataset with 10,000 rows and 30 features. In practical practices, accuracy of an insider threat detection is more essential than the high data storage demand but having accuracy improved and data storage demand reduced is a bonus. Therefore, in the experiments conducted for this report, the effectiveness of a dimensionality reduction technique is evaluated based on the maximum amount of data storage that can be reduced regardless of any amount of improvement in accuracy. Based on this kind of evaluation, the experimental results show that both feature selection methods RFE and Chi-Square Test in general did a good job on both KNN and KMC, but for feature extraction methods PCA did well only on KNN and LDA did exceptionally well only on KMC. From the results, it can be concluded that the performance of feature selection methods is more stable than feature extraction methods but the degree of improvements in terms of accuracy and data storage reduction by feature extraction methods are far more better than that by feature selection methods. One recommendation for future projects is to evaluate the effectiveness of previous mentioned dimensionality reduction techniques, in addition to Embedded feature selection method and other feature extraction methods, on supervised, unsupervised and reinforcement learning.