Diffense: defense against backdoor attacks on deep neural networks with latent diffusion

As deep neural network (DNN) models are used in a wide variety of applications, their security has attracted considerable attention. Among the known security vulnerabilities, backdoor attacks have become the most notorious threat to users of pre-trained DNNs and machine learning services. Such attac...

Full description

Saved in:
Bibliographic Details
Main Authors: Hu, Bowen, Chang, Chip Hong
Other Authors: School of Electrical and Electronic Engineering
Format: Article
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/10356/181984
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:As deep neural network (DNN) models are used in a wide variety of applications, their security has attracted considerable attention. Among the known security vulnerabilities, backdoor attacks have become the most notorious threat to users of pre-trained DNNs and machine learning services. Such attacks manipulate the training data or training process in such a way that the trained model produces a false output to an input that carries a specific trigger, but behaves normally otherwise. In this work, we propose Diffense, a method for detecting such malicious inputs based on the distribution of the latent feature maps to clean input samples of the possibly infected target DNN. By learning the feature map distribution using the diffusion model and sampling from the model under the guidance of the data to be inspected, backdoor attack data can be detected by its distance from the sampled result. Diffense does not require knowledge about the structure, weights, and training data of the target DNN model, nor does it need to be aware of the backdoor attack method. Diffense is non-intrusive. The accuracy of the target model to clean inputs will not be affected by Diffense and the inference service can be run uninterruptedly with Diffense. Extensive experiments were conducted on DNNs trained for MNIST, CIFRA-10, GSTRB, ImageNet-10, LSUN Object and LSUN Scene applications to show that the attack success rates of diverse backdoor attacks, including BadNets, IDBA, WaNet, ISSBA and HTBA, can be significantly suppressed by Diffense. The results generally exceed the performances of existing backdoor mitigation methods, including those that require model modifications or prerequisite knowledge of model weights or attack samples.