Attack prompt generation for red teaming and defending large language models

Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose...

Full description

Saved in:
Bibliographic Details
Main Authors: DENG, Boyi, WANG, Wenjie, FENG, Fuli, DENG, Yang, WANG, Qifan, HE, Xiangnan
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9118
https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10121
record_format dspace
spelling sg-smu-ink.sis_research-101212024-08-01T14:38:39Z Attack prompt generation for red teaming and defending large language models DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9118 info:doi/10.18653/v1/2023.findings-emnlp.143 https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Programming Languages and Compilers
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Programming Languages and Compilers
spellingShingle Programming Languages and Compilers
DENG, Boyi
WANG, Wenjie
FENG, Fuli
DENG, Yang
WANG, Qifan
HE, Xiangnan
Attack prompt generation for red teaming and defending large language models
description Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs.
format text
author DENG, Boyi
WANG, Wenjie
FENG, Fuli
DENG, Yang
WANG, Qifan
HE, Xiangnan
author_facet DENG, Boyi
WANG, Wenjie
FENG, Fuli
DENG, Yang
WANG, Qifan
HE, Xiangnan
author_sort DENG, Boyi
title Attack prompt generation for red teaming and defending large language models
title_short Attack prompt generation for red teaming and defending large language models
title_full Attack prompt generation for red teaming and defending large language models
title_fullStr Attack prompt generation for red teaming and defending large language models
title_full_unstemmed Attack prompt generation for red teaming and defending large language models
title_sort attack prompt generation for red teaming and defending large language models
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/9118
https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf
_version_ 1814047747041394688