Automated commit intelligence by pre-training

GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as comm...

Full description

Saved in:

Bibliographic Details
Main Authors:	Liu, Shangqing, Li, Yanzhou, Xie, Xiaofei, Ma, Wei, Meng, Guozhu, Liu, Yang
Other Authors:	College of Computing and Data Science
Format:	Article
Language:	English
Published:	2025
Subjects:	Computer and Information Science GitHub commit Code pre-training model
Online Access:	https://hdl.handle.net/10356/182466
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-182466
record_format	dspace
spelling	sg-ntu-dr.10356-1824662025-02-04T01:35:44Z Automated commit intelligence by pre-training Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang College of Computing and Data Science Computer and Information Science GitHub commit Code pre-training model GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance. Cyber Security Agency National Research Foundation (NRF) This research/project is supported by the National Research Foundation (NRF), Singapore, and the Cyber Security Agency under its National Cybersecurity R & D Programme (NCRP25-P04-TAICeN), the NRF, Singapore, and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-GC-2023-008) and NRF Investigatorship (NRF-NRFI06-2020-0001). 2025-02-04T01:35:43Z 2025-02-04T01:35:43Z 2024 Journal Article Liu, S., Li, Y., Xie, X., Ma, W., Meng, G. & Liu, Y. (2024). Automated commit intelligence by pre-training. ACM Transactions On Software Engineering and Methodology, 33(8), 3674731-. https://dx.doi.org/10.1145/3674731 1049-331X https://hdl.handle.net/10356/182466 10.1145/3674731 2-s2.0-85212983481 8 33 3674731 en NCRP25-P04-TAICeN AISG2-GC-2023-008 NRF-NRFI06-2020-0001 ACM Transactions on Software Engineering and Methodology © 2024 the Owner/Author(s). Publication rights licensed to ACM. All rights reserved.
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Computer and Information Science GitHub commit Code pre-training model
spellingShingle	Computer and Information Science GitHub commit Code pre-training model Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang Automated commit intelligence by pre-training
description	GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance.
author2	College of Computing and Data Science
author_facet	College of Computing and Data Science Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang
format	Article
author	Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang
author_sort	Liu, Shangqing
title	Automated commit intelligence by pre-training
title_short	Automated commit intelligence by pre-training
title_full	Automated commit intelligence by pre-training
title_fullStr	Automated commit intelligence by pre-training
title_full_unstemmed	Automated commit intelligence by pre-training
title_sort	automated commit intelligence by pre-training
publishDate	2025
url	https://hdl.handle.net/10356/182466
_version_	1823807360693436416

Automated commit intelligence by pre-training

Similar Items