Automated commit intelligence by pre-training
GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as comm...
Saved in:
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2025
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/182466 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-182466 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1824662025-02-04T01:35:44Z Automated commit intelligence by pre-training Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang College of Computing and Data Science Computer and Information Science GitHub commit Code pre-training model GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance. Cyber Security Agency National Research Foundation (NRF) This research/project is supported by the National Research Foundation (NRF), Singapore, and the Cyber Security Agency under its National Cybersecurity R & D Programme (NCRP25-P04-TAICeN), the NRF, Singapore, and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-GC-2023-008) and NRF Investigatorship (NRF-NRFI06-2020-0001). 2025-02-04T01:35:43Z 2025-02-04T01:35:43Z 2024 Journal Article Liu, S., Li, Y., Xie, X., Ma, W., Meng, G. & Liu, Y. (2024). Automated commit intelligence by pre-training. ACM Transactions On Software Engineering and Methodology, 33(8), 3674731-. https://dx.doi.org/10.1145/3674731 1049-331X https://hdl.handle.net/10356/182466 10.1145/3674731 2-s2.0-85212983481 8 33 3674731 en NCRP25-P04-TAICeN AISG2-GC-2023-008 NRF-NRFI06-2020-0001 ACM Transactions on Software Engineering and Methodology © 2024 the Owner/Author(s). Publication rights licensed to ACM. All rights reserved. |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science GitHub commit Code pre-training model |
spellingShingle |
Computer and Information Science GitHub commit Code pre-training model Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang Automated commit intelligence by pre-training |
description |
GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder–decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoizing objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance. |
author2 |
College of Computing and Data Science |
author_facet |
College of Computing and Data Science Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang |
format |
Article |
author |
Liu, Shangqing Li, Yanzhou Xie, Xiaofei Ma, Wei Meng, Guozhu Liu, Yang |
author_sort |
Liu, Shangqing |
title |
Automated commit intelligence by pre-training |
title_short |
Automated commit intelligence by pre-training |
title_full |
Automated commit intelligence by pre-training |
title_fullStr |
Automated commit intelligence by pre-training |
title_full_unstemmed |
Automated commit intelligence by pre-training |
title_sort |
automated commit intelligence by pre-training |
publishDate |
2025 |
url |
https://hdl.handle.net/10356/182466 |
_version_ |
1823807360693436416 |