Defending large language models against jailbreak attacks via layer-specific editing

Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHAO, Wei, LI, Zhe, LI, Yige, SUN, Jun
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/9832 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10832
record_format	dspace
spelling	sg-smu-ink.sis_research-108322024-12-24T03:34:19Z Defending large language models against jailbreak attacks via layer-specific editing ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm. 2024-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9832 info:doi/10.48550/arXiv.2405.18166 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Software Engineering
spellingShingle	Software Engineering ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun Defending large language models against jailbreak attacks via layer-specific editing
description	Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm.
format	text
author	ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun
author_facet	ZHAO, Wei LI, Zhe LI, Yige SUN, Jun SUN, Jun
author_sort	ZHAO, Wei
title	Defending large language models against jailbreak attacks via layer-specific editing
title_short	Defending large language models against jailbreak attacks via layer-specific editing
title_full	Defending large language models against jailbreak attacks via layer-specific editing
title_fullStr	Defending large language models against jailbreak attacks via layer-specific editing
title_full_unstemmed	Defending large language models against jailbreak attacks via layer-specific editing
title_sort	defending large language models against jailbreak attacks via layer-specific editing
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9832 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf
_version_	1821237243197521920

Defending large language models against jailbreak attacks via layer-specific editing

Similar Items