Protecting neural networks from adversarial attacks

As modern technology is rapidly progressing, more applications are utilizing aspects of machine learning—especially deep learning to time-critical and real-world applications. Adversaries are coming up with new ways to exploit attack surfaces in the machine learning process, rendering systems and ap...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Bryan Bing Xing
Other Authors: Anupam Chattopadhyay
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137938
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:As modern technology is rapidly progressing, more applications are utilizing aspects of machine learning—especially deep learning to time-critical and real-world applications. Adversaries are coming up with new ways to exploit attack surfaces in the machine learning process, rendering systems and applications to be ineffective. By using carefully crafted adversarial examples, an adversary can cause even the most well-trained model to fail. Current defences implementations do not seem to cover all the attack surfaces, resulting in the need to understand the concepts behind adversarial attacks even further, to develop a strategic defence mechanism. In this project, a theoretical framework is proposed to research more on the concepts of adversarial examples, to develop a defence strategy against such attacks, for image recognition applications. Before coming up with this defence strategy, a study is conducted to understand the concepts of adversarial attacks. The study involves looking at how adversarial perturbations are generated given an image, and on what specific areas do these perturbations are formed. Also, a study is conducted to see what does a classifier look at when trying to classify an image. From the information gathered from studies, several methods are imposed to find these specific areas and present them on attention maps. Next, an attempt to find a spatial correlation between these areas is made. The idea is to see if any of these areas overlap with each other, and if so, can the area that is susceptible to adversarial perturbations be minimized or removed. There are two key experiments for this project. The first experiment involves analyzing the performance of classification for both pre and post attacked, using Fast Gradient Sign Method attack. The second experiment uses the trained model to generate all of the elements of the proposed theoretical framework. The results are discussed and analyzed in hopes to develop a defence against adversarial attacks.