Towards general conceptual model editing via adversarial representation engineering
Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, develo...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2024
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/9833 https://ink.library.smu.edu.sg/context/sis_research/article/10833/viewcontent/2404.13752v3.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-10833 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-108332024-12-24T03:33:46Z Towards general conceptual model editing via adversarial representation engineering ZHANG, Yihao WEI, Zeming SUN, Jun SUN, Meng Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/ Zhang-Yihao/Adversarial-Representation-Engineering. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9833 info:doi/10.48550/arXiv.2404.13752 https://ink.library.smu.edu.sg/context/sis_research/article/10833/viewcontent/2404.13752v3.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Software Engineering |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Software Engineering |
spellingShingle |
Software Engineering ZHANG, Yihao WEI, Zeming SUN, Jun SUN, Meng Towards general conceptual model editing via adversarial representation engineering |
description |
Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/ Zhang-Yihao/Adversarial-Representation-Engineering. |
format |
text |
author |
ZHANG, Yihao WEI, Zeming SUN, Jun SUN, Meng |
author_facet |
ZHANG, Yihao WEI, Zeming SUN, Jun SUN, Meng |
author_sort |
ZHANG, Yihao |
title |
Towards general conceptual model editing via adversarial representation engineering |
title_short |
Towards general conceptual model editing via adversarial representation engineering |
title_full |
Towards general conceptual model editing via adversarial representation engineering |
title_fullStr |
Towards general conceptual model editing via adversarial representation engineering |
title_full_unstemmed |
Towards general conceptual model editing via adversarial representation engineering |
title_sort |
towards general conceptual model editing via adversarial representation engineering |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2024 |
url |
https://ink.library.smu.edu.sg/sis_research/9833 https://ink.library.smu.edu.sg/context/sis_research/article/10833/viewcontent/2404.13752v3.pdf |
_version_ |
1821237243628486656 |