Defending large language models against jailbreak attacks via layer-specific editing
Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9832 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-10832 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-108322024-12-24T03:34:19Z Defending large language models against jailbreak attacks via layer-specific editing ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm. 2024-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9832 info:doi/10.48550/arXiv.2405.18166 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Software Engineering |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Software Engineering |
spellingShingle |
Software Engineering ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun Defending large language models against jailbreak attacks via layer-specific editing |
description |
Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm. |
format |
text |
author |
ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun |
author_facet |
ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun |
author_sort |
ZHAO, Wei |
title |
Defending large language models against jailbreak attacks via layer-specific editing |
title_short |
Defending large language models against jailbreak attacks via layer-specific editing |
title_full |
Defending large language models against jailbreak attacks via layer-specific editing |
title_fullStr |
Defending large language models against jailbreak attacks via layer-specific editing |
title_full_unstemmed |
Defending large language models against jailbreak attacks via layer-specific editing |
title_sort |
defending large language models against jailbreak attacks via layer-specific editing |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/sis_research/9832 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf |
_version_ |
1820027794228248576 |