Defending large language models against jailbreak attacks via layer-specific editing

Defending large language models against jailbreak attacks via layer-specific editing

Large language models (LLMs) are increasingly being adopted in a wide range of realworld applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHAO, Wei, LI, Zhe, LI, Yige, SUN, Jun
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/9832 https://ink.library.smu.edu.sg/context/sis_research/article/10832/viewcontent/2024.findings_emnlp.293.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

Similar Items

Defending against phishing attacks
by: Tan, Justin Jui Kit
Published: (2024)

Attack prompt generation for red teaming and defending large language models
by: DENG, Boyi, et al.
Published: (2023)

SybilGuard: Defending against sybil attacks via social networks
by: Yu, H., et al.
Published: (2013)

Defending against cross-site scripting attacks
by: Shar, Lwin Khin, et al.
Published: (2013)

Defending against redirect attacks in mobile IP
by: DENG, Robert H., et al.
Published: (2002)

Defending against cross site scripting attacks
by: SHAR, Lwin Khin, et al.
Published: (2011)

Defending against model extraction attacks via watermark-based method with knowledge distillation
by: Zhang, Siting
Published: (2024)

Defending against distributed denial of service (DDoS) attacks
by: Wah, Chin Han
Published: (2013)

Defending against distributed denial of service (DDoS) attack
by: Zhang, Ran
Published: (2013)

Defending against Additive Attacks with Maximal Errors in Watermarking Relational Databases
by: LI, Yingjiu, et al.
Published: (2004)

Attacking human rights defenders
by: La Viña, Antonio Gabriel M.
Published: (2023)

Defending against Packet Injection Attacks in Unreliable Ad Hoc Networks
by: GU, Qijun, et al.
Published: (2005)

ROPecker: A Generic and Practical Approach For Defending Against ROP Attack
by: CHENG, Yueqiang, et al.
Published: (2014)

An adaptive Markov strategy for defending smart grid false data injection from malicious attackers
by: HAO, Jianye, et al.
Published: (2018)

Stable neural ODE with Lyapunov-stable equilibrium points for defending against adversarial attacks
by: Kang, Qiyu, et al.
Published: (2023)

Isolation forest-based mechanism to defend against interest flooding attacks in named data networking
by: Huang, Chaoran
Published: (2020)

Cryptography techniques to defend neural networks from adversarial attacks
by: Tan, Hong Meng
Published: (2024)

An extended study on addressing defender teamwork while accounting for uncertainty in attacker defender games using iterative Dec-MDPs
by: SHIEH, Eric, et al.
Published: (2016)

Defending Singapore Against Foreign Interference
by: Muhammad Faizal Abdul Rahman
Published: (2019)

Verifying neural networks against backdoor attacks
by: PHAM, Long Hong, et al.
Published: (2022)

Defending Singapore's vital infrastructure against terrorism
by: Acharya, Arabinda
Published: (2008)

Towards general conceptual model editing via adversarial representation engineering
by: ZHANG, Yihao, et al.
Published: (2024)

Instruction-guided image editing empowered by large language models
by: Wang, Yiying
Published: (2024)

Systematic classification of attackers via bounded model checking
by: ROTHSTEIN-MORRIS, Eric, et al.
Published: (2020)

Mitigating membership inference attacks via weighted smoothing
by: TAN, Minghan, et al.
Published: (2023)

Are timed automata bad for a specification language? Language inclusion checking for timed automata
by: WANG, Ting, et al.
Published: (2014)

Attack as detection: Using adversarial attack methods to detect abnormal examples
by: ZHAO, Zhe, et al.
Published: (2024)

Investigation on effective solutions against insider attacks
by: Ang, Jun Hao
Published: (2018)

Design security schemes against insider attacks
by: Goh, Jun Wei
Published: (2019)

Defending against heap overflow by using randomization in nested virtual clusters
by: TEY, Chee Meng, et al.
Published: (2013)

Defender of the sea.
by: Ferareza, Brian Kevin V., et al.
Published: (2014)

Defending rights
by: Contreras, Antonio P.
Published: (2020)

A non-viral genome editing platform for site-specific insertion of large transgenes
by: Chaudhari, Namrata, et al.
Published: (2021)

Contemporary jihadi terrorism : defending the city-state of Singapore against the asymmetric enemy
by: Muhammad Faizal Abdul Rahman
Published: (2019)

Defending negative liberty
by: KUKATHAS, Chandran
Published: (1994)

Defending male fertility
by: Rozen, S.
Published: (2014)

A verification system for interval-based specification languages
by: CHEN, Chunqing, et al.
Published: (2010)

Detection and classification of malicious JavaScript via attack behavior modelling
by: XUE, Yinxing, et al.
Published: (2015)

Why defensins defend us against bacteria? NMR studies telling a story
by: Bai, Yang
Published: (2013)

"I believe in god": defending religious belief against the challenge of epistemically irrelevant influences
by: Tan, Joy Angelina
Published: (2023)