Towards general conceptual model editing via adversarial representation engineering
Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, develo...
Saved in:
Main Authors: | ZHANG, Yihao, WEI, Zeming, SUN, Jun, SUN, Meng |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9833 https://ink.library.smu.edu.sg/context/sis_research/article/10833/viewcontent/2404.13752v3.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
Similar Items
-
Defending large language models against jailbreak attacks via layer-specific editing
by: ZHAO, Wei, et al.
Published: (2024) -
Towards characterizing adversarial defects of deep learning software from the lens of uncertainty
by: ZHANG, Xiyue, et al.
Published: (2020) -
Towards superior control in automatic face editing with generative adversarial networks
by: Zhang, Xijue
Published: (2022) -
White-box fairness testing through adversarial sampling
by: ZHANG, Peixin, et al.
Published: (2020) -
Attack as detection: Using adversarial attack methods to detect abnormal examples
by: ZHAO, Zhe, et al.
Published: (2024)