An empirical study to evaluate AIGC detectors on code content

Artificial Intelligence Generated Content (AIGC) has garnered considerable attention for its impressive performance, with Large Language Models (LLMs), like ChatGPT, emerging as a leading AIGC model that produces high-quality responses across various applications, including software development and...

Full description

Saved in:

Bibliographic Details
Main Authors:	WANG, Jian, LIU, Shangqing, XIE, Xiaofei, LI, Yi
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	AIGC Detection Code Generation Large Language Model Artificial Intelligence and Robotics
Online Access:	https://ink.library.smu.edu.sg/sis_research/9724 https://ink.library.smu.edu.sg/context/sis_research/article/10724/viewcontent/3691620.3695468.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-10724
record_format	dspace
spelling	sg-smu-ink.sis_research-107242024-12-16T06:58:01Z An empirical study to evaluate AIGC detectors on code content WANG, Jian LIU, Shangqing XIE, Xiaofei LI, Yi Artificial Intelligence Generated Content (AIGC) has garnered considerable attention for its impressive performance, with Large Language Models (LLMs), like ChatGPT, emerging as a leading AIGC model that produces high-quality responses across various applications, including software development and maintenance. Despite its potential, the misuse of LLMs, especially in security and safetycritical domains, such as academic integrity and answering questions on Stack Overflow, poses significant concerns. Numerous AIGC detectors have been developed and evaluated on natural language data. However, their performance on code-related content generated by LLMs remains unexplored. To fill this gap, in this paper, we present an empirical study evaluating existing AIGC detectors in the software domain. We select three state-of-the-art LLMs, i.e., GPT-3.5, WizardCoder and CodeLlama, for machine-content generation. We further created a comprehensive dataset including 2.23M samples comprising coderelated content for each model, encompassing popular software activities like Q&A (150K), code summarization (1M), and code generation (1.1M). We evaluated thirteen AIGC detectors, comprising six commercial and seven open-source solutions, assessing their performance on this dataset. Our results indicate that AIGC detectors perform less on code-related data than natural language data. Fine-tuning can enhance detector performance, especially for content within the same domain; but generalization remains a challenge. 2024-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/9724 info:doi/10.1145/3691620.3695468 https://ink.library.smu.edu.sg/context/sis_research/article/10724/viewcontent/3691620.3695468.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University AIGC Detection Code Generation Large Language Model Artificial Intelligence and Robotics
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	AIGC Detection Code Generation Large Language Model Artificial Intelligence and Robotics
spellingShingle	AIGC Detection Code Generation Large Language Model Artificial Intelligence and Robotics WANG, Jian LIU, Shangqing XIE, Xiaofei LI, Yi An empirical study to evaluate AIGC detectors on code content
description	Artificial Intelligence Generated Content (AIGC) has garnered considerable attention for its impressive performance, with Large Language Models (LLMs), like ChatGPT, emerging as a leading AIGC model that produces high-quality responses across various applications, including software development and maintenance. Despite its potential, the misuse of LLMs, especially in security and safetycritical domains, such as academic integrity and answering questions on Stack Overflow, poses significant concerns. Numerous AIGC detectors have been developed and evaluated on natural language data. However, their performance on code-related content generated by LLMs remains unexplored. To fill this gap, in this paper, we present an empirical study evaluating existing AIGC detectors in the software domain. We select three state-of-the-art LLMs, i.e., GPT-3.5, WizardCoder and CodeLlama, for machine-content generation. We further created a comprehensive dataset including 2.23M samples comprising coderelated content for each model, encompassing popular software activities like Q&A (150K), code summarization (1M), and code generation (1.1M). We evaluated thirteen AIGC detectors, comprising six commercial and seven open-source solutions, assessing their performance on this dataset. Our results indicate that AIGC detectors perform less on code-related data than natural language data. Fine-tuning can enhance detector performance, especially for content within the same domain; but generalization remains a challenge.
format	text
author	WANG, Jian LIU, Shangqing XIE, Xiaofei LI, Yi
author_facet	WANG, Jian LIU, Shangqing XIE, Xiaofei LI, Yi
author_sort	WANG, Jian
title	An empirical study to evaluate AIGC detectors on code content
title_short	An empirical study to evaluate AIGC detectors on code content
title_full	An empirical study to evaluate AIGC detectors on code content
title_fullStr	An empirical study to evaluate AIGC detectors on code content
title_full_unstemmed	An empirical study to evaluate AIGC detectors on code content
title_sort	empirical study to evaluate aigc detectors on code content
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/9724 https://ink.library.smu.edu.sg/context/sis_research/article/10724/viewcontent/3691620.3695468.pdf
_version_	1819113119785418752

An empirical study to evaluate AIGC detectors on code content

Similar Items