Defending large language models against jailbreak attacks via layer-specific editing

Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHAO, Wei, LI, Zhe, LI, Yige, SUN, Jun
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/9832 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Be the first to leave a comment!

Defending large language models against jailbreak attacks via layer-specific editing

Similar Items