Attack prompt generation for red teaming and defending large language models
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9118 https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-10121 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-101212024-08-01T14:38:39Z Attack prompt generation for red teaming and defending large language models DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9118 info:doi/10.18653/v1/2023.findings-emnlp.143 https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Programming Languages and Compilers |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Programming Languages and Compilers |
spellingShingle |
Programming Languages and Compilers DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan Attack prompt generation for red teaming and defending large language models |
description |
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. |
format |
text |
author |
DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan |
author_facet |
DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan |
author_sort |
DENG, Boyi |
title |
Attack prompt generation for red teaming and defending large language models |
title_short |
Attack prompt generation for red teaming and defending large language models |
title_full |
Attack prompt generation for red teaming and defending large language models |
title_fullStr |
Attack prompt generation for red teaming and defending large language models |
title_full_unstemmed |
Attack prompt generation for red teaming and defending large language models |
title_sort |
attack prompt generation for red teaming and defending large language models |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/9118 https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf |
_version_ |
1814047747041394688 |