Towards general conceptual model editing via adversarial representation engineering

Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, develo...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHANG, Yihao, WEI, Zeming, SUN, Jun, SUN, Meng
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9833
https://ink.library.smu.edu.sg/context/sis_research/article/10833/viewcontent/2404.13752v3.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10833
record_format dspace
spelling sg-smu-ink.sis_research-108332024-12-24T03:33:46Z Towards general conceptual model editing via adversarial representation engineering ZHANG, Yihao WEI, Zeming SUN, Jun SUN, Meng Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/ Zhang-Yihao/Adversarial-Representation-Engineering. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9833 info:doi/10.48550/arXiv.2404.13752 https://ink.library.smu.edu.sg/context/sis_research/article/10833/viewcontent/2404.13752v3.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Software Engineering
spellingShingle Software Engineering
ZHANG, Yihao
WEI, Zeming
SUN, Jun
SUN, Meng
Towards general conceptual model editing via adversarial representation engineering
description Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at https://github.com/ Zhang-Yihao/Adversarial-Representation-Engineering.
format text
author ZHANG, Yihao
WEI, Zeming
SUN, Jun
SUN, Meng
author_facet ZHANG, Yihao
WEI, Zeming
SUN, Jun
SUN, Meng
author_sort ZHANG, Yihao
title Towards general conceptual model editing via adversarial representation engineering
title_short Towards general conceptual model editing via adversarial representation engineering
title_full Towards general conceptual model editing via adversarial representation engineering
title_fullStr Towards general conceptual model editing via adversarial representation engineering
title_full_unstemmed Towards general conceptual model editing via adversarial representation engineering
title_sort towards general conceptual model editing via adversarial representation engineering
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9833
https://ink.library.smu.edu.sg/context/sis_research/article/10833/viewcontent/2404.13752v3.pdf
_version_ 1821237243628486656