Defending large language models against jailbreak attacks via layer-specific editing
Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback...
Saved in:
Main Authors: | ZHAO, Wei, LI, Zhe, LI, Yige, SUN, Jun |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9832 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
Similar Items
-
Defending against phishing attacks
by: Tan, Justin Jui Kit
Published: (2024) -
Attack prompt generation for red teaming and defending large language models
by: DENG, Boyi, et al.
Published: (2023) -
SybilGuard: Defending against sybil attacks via social networks
by: Yu, H., et al.
Published: (2013) -
Defending against cross-site scripting attacks
by: Shar, Lwin Khin, et al.
Published: (2013) -
Defending against redirect attacks in mobile IP
by: DENG, Robert H., et al.
Published: (2002)