Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models

Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including publi...

Full description

Saved in:

Bibliographic Details
Main Authors:	YANG, Zhou, ZHAO, Zhipeng, WANG, Chenyu, SHI, Jieke, KIM, Dongsum, HAN, Donggyun, LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Membership inference attack Privacy Large Langauge Models for code Code completion Information Security Numerical Analysis and Scientific Computing
Online Access:	https://ink.library.smu.edu.sg/sis_research/9889 https://ink.library.smu.edu.sg/context/sis_research/article/10889/viewcontent/2310.01166v2.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10889
record_format	dspace
spelling	sg-smu-ink.sis_research-108892025-01-02T09:08:43Z Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsum HAN, Donggyun LO, David Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: What is the risk of membership information leakage in code models? Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present Gotcha , a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. Gotcha simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: membership leakage risk is significantly elevated . While previous methods had accuracy close to random guessing, Gotcha achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats. 2024-12-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9889 info:doi/10.1109/TSE.2024.3482719 https://ink.library.smu.edu.sg/context/sis_research/article/10889/viewcontent/2310.01166v2.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Membership inference attack Privacy Large Langauge Models for code Code completion Information Security Numerical Analysis and Scientific Computing
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Membership inference attack Privacy Large Langauge Models for code Code completion Information Security Numerical Analysis and Scientific Computing
spellingShingle	Membership inference attack Privacy Large Langauge Models for code Code completion Information Security Numerical Analysis and Scientific Computing YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsum HAN, Donggyun LO, David Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models
description	Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: What is the risk of membership information leakage in code models? Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present Gotcha , a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. Gotcha simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: membership leakage risk is significantly elevated . While previous methods had accuracy close to random guessing, Gotcha achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.
format	text
author	YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsum HAN, Donggyun LO, David
author_facet	YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsum HAN, Donggyun LO, David
author_sort	YANG, Zhou
title	Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models
title_short	Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models
title_full	Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models
title_fullStr	Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models
title_full_unstemmed	Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models
title_sort	gotcha ! this model uses my code ! evaluating membership leakage risks in code models
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9889 https://ink.library.smu.edu.sg/context/sis_research/article/10889/viewcontent/2310.01166v2.pdf
_version_	1821237275635220480

Gotcha ! This model uses my code ! Evaluating membership leakage risks in code models

Similar Items