Attack prompt generation for red teaming and defending large language models

Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose...

Full description

Saved in:

Bibliographic Details
Main Authors:	DENG, Boyi, WANG, Wenjie, FENG, Fuli, DENG, Yang, WANG, Qifan, HE, Xiangnan
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2023
Subjects:	Programming Languages and Compilers
Online Access:	https://ink.library.smu.edu.sg/sis_research/9118 https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10121
record_format	dspace
spelling	sg-smu-ink.sis_research-101212024-08-01T14:38:39Z Attack prompt generation for red teaming and defending large language models DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. 2023-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9118 info:doi/10.18653/v1/2023.findings-emnlp.143 https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Programming Languages and Compilers
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Programming Languages and Compilers
spellingShingle	Programming Languages and Compilers DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan Attack prompt generation for red teaming and defending large language models
description	Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs.
format	text
author	DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan
author_facet	DENG, Boyi WANG, Wenjie FENG, Fuli DENG, Yang WANG, Qifan HE, Xiangnan
author_sort	DENG, Boyi
title	Attack prompt generation for red teaming and defending large language models
title_short	Attack prompt generation for red teaming and defending large language models
title_full	Attack prompt generation for red teaming and defending large language models
title_fullStr	Attack prompt generation for red teaming and defending large language models
title_full_unstemmed	Attack prompt generation for red teaming and defending large language models
title_sort	attack prompt generation for red teaming and defending large language models
publisher	Institutional Knowledge at Singapore Management University
publishDate	2023
url	https://ink.library.smu.edu.sg/sis_research/9118 https://ink.library.smu.edu.sg/context/sis_research/article/10121/viewcontent/2023.findings_emnlp.143.pdf
_version_	1814047747041394688

Attack prompt generation for red teaming and defending large language models

Similar Items