CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding

Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to genera...

Full description

Saved in:
Bibliographic Details
Main Authors: LI, Dong, YAN Meng, ZHANG, Yaosheng, LIU, Zhongxin, LIU, Chao, ZHANG, Xiaohong, CHEN, Ting, David LO
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2024
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9918
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-10918
record_format dspace
spelling sg-smu-ink.sis_research-109182025-01-02T08:03:58Z CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding LI, Dong YAN Meng, ZHANG, Yaosheng LIU, Zhongxin LIU, Chao ZHANG, Xiaohong CHEN, Ting David LO, Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to generate code with potential vulnerabilities. Existing approaches to mitigate this risk involve crafting data without vulnerability and subsequently retraining or fine-tuning the model. As the number of parameters exceeds a billion, the computation and data demands of the above approaches will be enormous. Moreover, an increasing number of code LLMs tend to be distributed as services, where the internal representation is not accessible, and the API is the only way to reach the LLM, making the prior mitigation strategies non-applicable. To cope with this, we propose CoSec, an on-the-fly Security hardening method of code LLMs based on security model-guided Co-decoding, to reduce the likelihood of code LLMs to generate code containing vulnerabilities. Our key idea is to train a separate but much smaller security model to co-decode with a target code LLM. Since the trained secure model has higher confidence for secure tokens, it guides the generation of the target base model towards more secure code generation. By adjusting the probability distributions of tokens during each step of the decoding process, our approach effectively influences the tendencies of generation without accessing the internal parameters of the target code LLM. We have conducted extensive experiments across various parameters in multiple code LLMs (i.e., CodeGen, StarCoder, and DeepSeek-Coder), and the results show that our approach is effective in security hardening. Specifically, our approach improves the average security ratio of six base models by 5.02%-37.14%, while maintaining the functional correctness of the target model. 2024-09-16T07:00:00Z text https://ink.library.smu.edu.sg/sis_research/9918 info:doi/10.1145/3650212.3680371 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code generation Large Language Models Security hardening Model training Artificial Intelligence and Robotics Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Code generation
Large Language Models
Security hardening
Model training
Artificial Intelligence and Robotics
Software Engineering
spellingShingle Code generation
Large Language Models
Security hardening
Model training
Artificial Intelligence and Robotics
Software Engineering
LI, Dong
YAN Meng,
ZHANG, Yaosheng
LIU, Zhongxin
LIU, Chao
ZHANG, Xiaohong
CHEN, Ting
David LO,
CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding
description Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to generate code with potential vulnerabilities. Existing approaches to mitigate this risk involve crafting data without vulnerability and subsequently retraining or fine-tuning the model. As the number of parameters exceeds a billion, the computation and data demands of the above approaches will be enormous. Moreover, an increasing number of code LLMs tend to be distributed as services, where the internal representation is not accessible, and the API is the only way to reach the LLM, making the prior mitigation strategies non-applicable. To cope with this, we propose CoSec, an on-the-fly Security hardening method of code LLMs based on security model-guided Co-decoding, to reduce the likelihood of code LLMs to generate code containing vulnerabilities. Our key idea is to train a separate but much smaller security model to co-decode with a target code LLM. Since the trained secure model has higher confidence for secure tokens, it guides the generation of the target base model towards more secure code generation. By adjusting the probability distributions of tokens during each step of the decoding process, our approach effectively influences the tendencies of generation without accessing the internal parameters of the target code LLM. We have conducted extensive experiments across various parameters in multiple code LLMs (i.e., CodeGen, StarCoder, and DeepSeek-Coder), and the results show that our approach is effective in security hardening. Specifically, our approach improves the average security ratio of six base models by 5.02%-37.14%, while maintaining the functional correctness of the target model.
format text
author LI, Dong
YAN Meng,
ZHANG, Yaosheng
LIU, Zhongxin
LIU, Chao
ZHANG, Xiaohong
CHEN, Ting
David LO,
author_facet LI, Dong
YAN Meng,
ZHANG, Yaosheng
LIU, Zhongxin
LIU, Chao
ZHANG, Xiaohong
CHEN, Ting
David LO,
author_sort LI, Dong
title CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding
title_short CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding
title_full CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding
title_fullStr CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding
title_full_unstemmed CoSec : On-the-Fly security hardening of code LLMs via supervised co-decoding
title_sort cosec : on-the-fly security hardening of code llms via supervised co-decoding
publisher Institutional Knowledge at Singapore Management University
publishDate 2024
url https://ink.library.smu.edu.sg/sis_research/9918
_version_ 1821237285263245312