Defending large language models against jailbreak attacks via layer-specific editing

Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHAO, Wei, LI, Zhe, LI, Yige, SUN, Jun
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9832
https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10832
record_format dspace
spelling sg-smu-ink.sis_research-108322024-12-24T03:34:19Z Defending large language models against jailbreak attacks via layer-specific editing ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm. 2024-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9832 info:doi/10.48550/arXiv.2405.18166 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Software Engineering
spellingShingle Software Engineering
ZHAO, Wei
LI, Zhe
LI, Yige
SUN, Jun
SUN, Jun
Defending large language models against jailbreak attacks via layer-specific editing
description Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm.
format text
author ZHAO, Wei
LI, Zhe
LI, Yige
SUN, Jun
SUN, Jun
author_facet ZHAO, Wei
LI, Zhe
LI, Yige
SUN, Jun
SUN, Jun
author_sort ZHAO, Wei
title Defending large language models against jailbreak attacks via layer-specific editing
title_short Defending large language models against jailbreak attacks via layer-specific editing
title_full Defending large language models against jailbreak attacks via layer-specific editing
title_fullStr Defending large language models against jailbreak attacks via layer-specific editing
title_full_unstemmed Defending large language models against jailbreak attacks via layer-specific editing
title_sort defending large language models against jailbreak attacks via layer-specific editing
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9832
https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf
_version_ 1820027794228248576