Unveiling memorization in code models

The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of lines of code from both open-source and private repositories. A...

Full description

Saved in:

Bibliographic Details
Main Authors:	YANG, Zhou, ZHAO, Zhipeng, WANG, Chenyu, SHI, Jieke, KIM, Dongsun, HAN, DongGyun, LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Open-Source Software Memorization Code Generation Programming Languages and Compilers Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/9246 https://ink.library.smu.edu.sg/context/sis_research/article/10246/viewcontent/3597503.3639074.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10246
record_format	dspace
spelling	sg-smu-ink.sis_research-102462024-09-02T06:42:06Z Unveiling memorization in code models YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsun HAN, DongGyun LO, David The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of lines of code from both open-source and private repositories. A code model memorizes and produces source code verbatim, which potentially contains vulnerabilities, sensitive information, or code with strict licenses, leading to potential security and privacy issues.This paper investigates an important problem: to what extent do code models memorize their training data? We conduct an empirical study to explore memorization in large pre-trained code models. Our study highlights that simply extracting 20,000 outputs (each having 512 tokens) from a code model can produce over 40,125 code snippets that are memorized from the training data. To provide a better understanding, we build a taxonomy of memorized contents with 3 categories and 14 subcategories. The results show that the prompts sent to the code models affect the distribution of memorized contents. We identify several key factors of memorization. Specifically, given the same architecture, larger models suffer more from memorization problem. A code model produces more memorization when it is allowed to generate longer outputs. We also find a strong positive correlation between the number of an output's occurrences in the training data and that in the generated outputs, which indicates that a potential way to reduce memorization is to remove duplicates in the training data. We then identify effective metrics that infer whether an output contains memorization accurately. We also make suggestions to deal with memorization. 2024-04-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9246 info:doi/10.1145/3597503.363907 https://ink.library.smu.edu.sg/context/sis_research/article/10246/viewcontent/3597503.3639074.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Open-Source Software Memorization Code Generation Programming Languages and Compilers Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Open-Source Software Memorization Code Generation Programming Languages and Compilers Software Engineering
spellingShingle	Open-Source Software Memorization Code Generation Programming Languages and Compilers Software Engineering YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsun HAN, DongGyun LO, David Unveiling memorization in code models
description	The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of lines of code from both open-source and private repositories. A code model memorizes and produces source code verbatim, which potentially contains vulnerabilities, sensitive information, or code with strict licenses, leading to potential security and privacy issues.This paper investigates an important problem: to what extent do code models memorize their training data? We conduct an empirical study to explore memorization in large pre-trained code models. Our study highlights that simply extracting 20,000 outputs (each having 512 tokens) from a code model can produce over 40,125 code snippets that are memorized from the training data. To provide a better understanding, we build a taxonomy of memorized contents with 3 categories and 14 subcategories. The results show that the prompts sent to the code models affect the distribution of memorized contents. We identify several key factors of memorization. Specifically, given the same architecture, larger models suffer more from memorization problem. A code model produces more memorization when it is allowed to generate longer outputs. We also find a strong positive correlation between the number of an output's occurrences in the training data and that in the generated outputs, which indicates that a potential way to reduce memorization is to remove duplicates in the training data. We then identify effective metrics that infer whether an output contains memorization accurately. We also make suggestions to deal with memorization.
format	text
author	YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsun HAN, DongGyun LO, David
author_facet	YANG, Zhou ZHAO, Zhipeng WANG, Chenyu SHI, Jieke KIM, Dongsun HAN, DongGyun LO, David
author_sort	YANG, Zhou
title	Unveiling memorization in code models
title_short	Unveiling memorization in code models
title_full	Unveiling memorization in code models
title_fullStr	Unveiling memorization in code models
title_full_unstemmed	Unveiling memorization in code models
title_sort	unveiling memorization in code models
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9246 https://ink.library.smu.edu.sg/context/sis_research/article/10246/viewcontent/3597503.3639074.pdf
_version_	1814047843564912640

Unveiling memorization in code models

Similar Items