Automated commit intelligence by pre-training

GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as comm...

Full description

Saved in:
Bibliographic Details
Main Authors: Liu, Shangqing, Li, Yanzhou, Xie, Xiaofei, Ma, Wei, Meng, Guozhu, Liu, Yang
Other Authors: College of Computing and Data Science
Format: Article
Language:English
Published: 2025
Subjects:
Online Access:https://hdl.handle.net/10356/182466
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance.