Retrieval based code summarisation using code pre-trained models
Automatic code summarization has emerged as a valuable tool for enhancing software development speed and comprehension. In the context of source code summarization, where the goal is to generate concise and meaningful natural language summaries for given code snippets, pre-trained models have shown...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175679 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-175679 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1756792024-05-03T15:38:42Z Retrieval based code summarisation using code pre-trained models Gupta, Sahaj Liu Yang School of Computer Science and Engineering yangliu@ntu.edu.sg Computer and Information Science Code pre-trained models Automatic code summarization has emerged as a valuable tool for enhancing software development speed and comprehension. In the context of source code summarization, where the goal is to generate concise and meaningful natural language summaries for given code snippets, pre-trained models have shown significant promise. This task involves understanding the semantics of code and generating human-readable descriptions, making it particularly challenging. Attention mechanisms, inherent in many pre-trained models, have shown to be particularly useful for source code summarization. They allow models to focus on relevant parts of the code when generating summaries, improving the coherence and informativeness of the generated text. This project delves into the effectiveness of different code pre-trained language models for generating concise and informative summaries of source code. We focus on comparing our proposed retrieval-based framework against state-of-the- art models and baselines in CodeXGLUE. We fine-tune and evaluate CodeBERT, an encoder-only model pre-trained on massive code repositories, using our proposed method. We evaluate the performance of these models on benchmark datasets, considering metrics like BLEU-4 and perplexity to assess the quality of generated summaries compared to human references. Further, we also perform a token level analysis of the input data. We observe a comparable performance to the baseline models and an improvement of 0.81, 0.13 and 0.24 in BLEU-4 scores for the Javascript, Java and Go programming languages highlighting the effectiveness of our approach. Bachelor's degree 2024-05-03T01:47:18Z 2024-05-03T01:47:18Z 2024 Final Year Project (FYP) Gupta, S. (2024). Retrieval based code summarisation using code pre-trained models. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175679 https://hdl.handle.net/10356/175679 en application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Computer and Information Science Code pre-trained models |
spellingShingle |
Computer and Information Science Code pre-trained models Gupta, Sahaj Retrieval based code summarisation using code pre-trained models |
description |
Automatic code summarization has emerged as a valuable tool for enhancing software development speed and comprehension. In the context of source code summarization, where the goal is to generate concise and meaningful natural language summaries for given code snippets, pre-trained models have shown significant promise. This task involves understanding the semantics of code and generating human-readable descriptions, making it particularly challenging. Attention mechanisms, inherent in many pre-trained models, have shown to be particularly useful for source code summarization. They allow models to focus on relevant parts of the code when generating summaries, improving the coherence and informativeness of the generated text.
This project delves into the effectiveness of different code pre-trained language models for generating concise and informative summaries of source code.
We focus on comparing our proposed retrieval-based framework against state-of-the- art models and baselines in CodeXGLUE. We fine-tune and evaluate CodeBERT, an encoder-only model pre-trained on massive code repositories, using our proposed method. We evaluate the performance of these models on benchmark datasets, considering metrics like BLEU-4 and perplexity to assess the quality of generated summaries compared to human references. Further, we also perform a token level analysis of the input data.
We observe a comparable performance to the baseline models and an improvement of 0.81, 0.13 and 0.24 in BLEU-4 scores for the Javascript, Java and Go programming languages highlighting the effectiveness of our approach. |
author2 |
Liu Yang |
author_facet |
Liu Yang Gupta, Sahaj |
format |
Final Year Project |
author |
Gupta, Sahaj |
author_sort |
Gupta, Sahaj |
title |
Retrieval based code summarisation using code pre-trained models |
title_short |
Retrieval based code summarisation using code pre-trained models |
title_full |
Retrieval based code summarisation using code pre-trained models |
title_fullStr |
Retrieval based code summarisation using code pre-trained models |
title_full_unstemmed |
Retrieval based code summarisation using code pre-trained models |
title_sort |
retrieval based code summarisation using code pre-trained models |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/175679 |
_version_ |
1814047021663780864 |