Federated learning for software engineering: A case study of code clone detection and defect prediction

In various research domains, artificial intelligence (AI) has gained significant prominence, leading to the development of numerous learning-based models in research laboratories, which are evaluated using benchmark datasets. While the models proposed in previous studies may demonstrate satisfactory...

Full description

Saved in:

Bibliographic Details
Main Authors:	YANG, Yanming, HU, Xing, GAO, Zhipeng, CHEN, Jinfu, NI, Chao, XIA, Xin, LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2024
Subjects:	Benchmark testing Cloning Code clone detection Codes Data models Defect prediction Federated learning Parameter aggregation strategy Skewed data distribution Task analysis Training Numerical Analysis and Scientific Computing Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/8632
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-9635
record_format	dspace
spelling	sg-smu-ink.sis_research-96352024-01-25T06:30:03Z Federated learning for software engineering: A case study of code clone detection and defect prediction YANG, Yanming HU, Xing GAO, Zhipeng CHEN, Jinfu NI, Chao XIA, Xin LO, David In various research domains, artificial intelligence (AI) has gained significant prominence, leading to the development of numerous learning-based models in research laboratories, which are evaluated using benchmark datasets. While the models proposed in previous studies may demonstrate satisfactory performance on benchmark datasets, translating academic findings into practical applications for industry practitioners presents challenges. This can entail either the direct adoption of trained academic models into industrial applications, leading to a performance decrease, or retraining models with industrial data, a task often hindered by insufficient data instances or skewed data distributions. Real-world industrial data is typically significantly more intricate than benchmark datasets, frequently exhibiting data-skewing issues, such as label distribution skews and quantity skews. Furthermore, accessing industrial data, particularly source code, can prove challenging for Software Engineering (SE) researchers due to privacy policies. This limitation hinders SE researchers’ ability to gain insights into industry developers’ concerns and subsequently enhance their proposed models. To bridge the divide between academic models and industrial applications, we introduce a federated learning (FL)-based framework called A lmity . Our aim is to simplify the process of implementing research findings into practical use for both SE researchers and industry developers. A lmity enhances model performance on sensitive skewed data distributions while ensuring data privacy and security. It introduces an innovative aggregation strategy that takes into account three key attributes: data scale, data balance, and minority class learnability. This strategy is employed to refine model parameters, thereby enhancing model performance on sensitive skewed datasets. In our evaluation, we employ two well-established SE tasks, i.e., code clone detection and defect prediction, as evaluation tasks. We compare the performance of Almity on both machine learning (ML) and deep learning (DL) models against two mainstream training methods, specifically the Centralized Training Method (CTM) and Vanilla Federated Learning (VFL), to validate the effectiveness and generalizability of Almity. Our experimental results demonstrate that our framework is not only feasible but also practical in real-world scenarios. Almity consistently enhances the performance of learning-based models, outperforming baseline training methods across all types of data distributions. 2024-01-01T08:00:00Z text https://ink.library.smu.edu.sg/sis_research/8632 info:doi/10.1109/TSE.2023.3347898 Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Benchmark testing Cloning Code clone detection Codes Data models Defect prediction Federated learning Parameter aggregation strategy Skewed data distribution Task analysis Training Numerical Analysis and Scientific Computing Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Benchmark testing Cloning Code clone detection Codes Data models Defect prediction Federated learning Parameter aggregation strategy Skewed data distribution Task analysis Training Numerical Analysis and Scientific Computing Software Engineering
spellingShingle	Benchmark testing Cloning Code clone detection Codes Data models Defect prediction Federated learning Parameter aggregation strategy Skewed data distribution Task analysis Training Numerical Analysis and Scientific Computing Software Engineering YANG, Yanming HU, Xing GAO, Zhipeng CHEN, Jinfu NI, Chao XIA, Xin LO, David Federated learning for software engineering: A case study of code clone detection and defect prediction
description	In various research domains, artificial intelligence (AI) has gained significant prominence, leading to the development of numerous learning-based models in research laboratories, which are evaluated using benchmark datasets. While the models proposed in previous studies may demonstrate satisfactory performance on benchmark datasets, translating academic findings into practical applications for industry practitioners presents challenges. This can entail either the direct adoption of trained academic models into industrial applications, leading to a performance decrease, or retraining models with industrial data, a task often hindered by insufficient data instances or skewed data distributions. Real-world industrial data is typically significantly more intricate than benchmark datasets, frequently exhibiting data-skewing issues, such as label distribution skews and quantity skews. Furthermore, accessing industrial data, particularly source code, can prove challenging for Software Engineering (SE) researchers due to privacy policies. This limitation hinders SE researchers’ ability to gain insights into industry developers’ concerns and subsequently enhance their proposed models. To bridge the divide between academic models and industrial applications, we introduce a federated learning (FL)-based framework called A lmity . Our aim is to simplify the process of implementing research findings into practical use for both SE researchers and industry developers. A lmity enhances model performance on sensitive skewed data distributions while ensuring data privacy and security. It introduces an innovative aggregation strategy that takes into account three key attributes: data scale, data balance, and minority class learnability. This strategy is employed to refine model parameters, thereby enhancing model performance on sensitive skewed datasets. In our evaluation, we employ two well-established SE tasks, i.e., code clone detection and defect prediction, as evaluation tasks. We compare the performance of Almity on both machine learning (ML) and deep learning (DL) models against two mainstream training methods, specifically the Centralized Training Method (CTM) and Vanilla Federated Learning (VFL), to validate the effectiveness and generalizability of Almity. Our experimental results demonstrate that our framework is not only feasible but also practical in real-world scenarios. Almity consistently enhances the performance of learning-based models, outperforming baseline training methods across all types of data distributions.
format	text
author	YANG, Yanming HU, Xing GAO, Zhipeng CHEN, Jinfu NI, Chao XIA, Xin LO, David
author_facet	YANG, Yanming HU, Xing GAO, Zhipeng CHEN, Jinfu NI, Chao XIA, Xin LO, David
author_sort	YANG, Yanming
title	Federated learning for software engineering: A case study of code clone detection and defect prediction
title_short	Federated learning for software engineering: A case study of code clone detection and defect prediction
title_full	Federated learning for software engineering: A case study of code clone detection and defect prediction
title_fullStr	Federated learning for software engineering: A case study of code clone detection and defect prediction
title_full_unstemmed	Federated learning for software engineering: A case study of code clone detection and defect prediction
title_sort	federated learning for software engineering: a case study of code clone detection and defect prediction
publisher	Institutional Knowledge at Singapore Management University
publishDate	2024
url	https://ink.library.smu.edu.sg/sis_research/8632
_version_	1789483295829917696

Federated learning for software engineering: A case study of code clone detection and defect prediction

Similar Items