Automated commit intelligence by pre-training

GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as comm...

Full description

Saved in:
Bibliographic Details
Main Authors: Liu, Shangqing, Li, Yanzhou, Xie, Xiaofei, Ma, Wei, Meng, Guozhu, Liu, Yang
Other Authors: College of Computing and Data Science
Format: Article
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182466
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-182466
record_format dspace
spelling sg-ntu-dr.10356-1824662025-02-04T01:35:44Z Automated commit intelligence by pre-training Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang College of Computing and Data Science Computer and Information Science GitHub commit Code pre-training model GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance. Cyber Security Agency National Research Foundation (NRF) This research/project is supported by the National Research Foundation (NRF), Singapore, and the Cyber Security Agency under its National Cybersecurity R & D Programme (NCRP25-P04-TAICeN), the NRF, Singapore, and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-GC-2023-008) and NRF Investigatorship (NRF-NRFI06-2020-0001). 2025-02-04T01:35:43Z 2025-02-04T01:35:43Z 2024 Journal Article Liu, S., Li, Y., Xie, X., Ma, W., Meng, G. & Liu, Y. (2024). Automated commit intelligence by pre-training. ACM Transactions On Software Engineering and Methodology, 33(8), 3674731-. https://dx.doi.org/10.1145/3674731 1049-331X https://hdl.handle.net/10356/182466 10.1145/3674731 2-s2.0-85212983481 8 33 3674731 en NCRP25-P04-TAICeN AISG2-GC-2023-008 NRF-NRFI06-2020-0001 ACM Transactions on Software Engineering and Methodology © 2024 the Owner/Author(s). Publication rights licensed to ACM. All rights reserved.
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Computer and Information Science
GitHub commit
Code pre-training model
spellingShingle Computer and Information Science
GitHub commit
Code pre-training model
Liu, Shangqing
Li, Yanzhou
Xie, Xiaofei
Ma, Wei
Meng, Guozhu
Liu, Yang
Automated commit intelligence by pre-training
description GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance.
author2 College of Computing and Data Science
author_facet College of Computing and Data Science
Liu, Shangqing
Li, Yanzhou
Xie, Xiaofei
Ma, Wei
Meng, Guozhu
Liu, Yang
format Article
author Liu, Shangqing
Li, Yanzhou
Xie, Xiaofei
Ma, Wei
Meng, Guozhu
Liu, Yang
author_sort Liu, Shangqing
title Automated commit intelligence by pre-training
title_short Automated commit intelligence by pre-training
title_full Automated commit intelligence by pre-training
title_fullStr Automated commit intelligence by pre-training
title_full_unstemmed Automated commit intelligence by pre-training
title_sort automated commit intelligence by pre-training
publishDate 2025
url https://hdl.handle.net/10356/182466
_version_ 1823807360693436416