Correlating automated and human evaluation of code documentation generation quality

Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code...

Full description

Saved in:

Bibliographic Details
Main Authors:	HU, Xing, CHEN, Qiuyuan, WANG, Haoye, XIA, Xin, LO, David, ZIMMERMANN, Thomas
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Code Documentation Generation Evaluation Metrics Empirical Study Databases and Information Systems Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/7664 https://ink.library.smu.edu.sg/context/sis_research/article/8667/viewcontent/tosem218.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8667
record_format	dspace
spelling	sg-smu-ink.sis_research-86672023-01-10T03:42:23Z Correlating automated and human evaluation of code documentation generation quality HU, Xing CHEN, Qiuyuan WANG, Haoye XIA, Xin LO, David ZIMMERMANN, Thomas Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate the presence or absence of correlations between these metrics and human judgments. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with a high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation [39]). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks. 2022-01-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7664 info:doi/10.1145/3502853 https://ink.library.smu.edu.sg/context/sis_research/article/8667/viewcontent/tosem218.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Code Documentation Generation Evaluation Metrics Empirical Study Databases and Information Systems Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Code Documentation Generation Evaluation Metrics Empirical Study Databases and Information Systems Software Engineering
spellingShingle	Code Documentation Generation Evaluation Metrics Empirical Study Databases and Information Systems Software Engineering HU, Xing CHEN, Qiuyuan WANG, Haoye XIA, Xin LO, David ZIMMERMANN, Thomas Correlating automated and human evaluation of code documentation generation quality
description	Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate the presence or absence of correlations between these metrics and human judgments. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with a high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation [39]). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.
format	text
author	HU, Xing CHEN, Qiuyuan WANG, Haoye XIA, Xin LO, David ZIMMERMANN, Thomas
author_facet	HU, Xing CHEN, Qiuyuan WANG, Haoye XIA, Xin LO, David ZIMMERMANN, Thomas
author_sort	HU, Xing
title	Correlating automated and human evaluation of code documentation generation quality
title_short	Correlating automated and human evaluation of code documentation generation quality
title_full	Correlating automated and human evaluation of code documentation generation quality
title_fullStr	Correlating automated and human evaluation of code documentation generation quality
title_full_unstemmed	Correlating automated and human evaluation of code documentation generation quality
title_sort	correlating automated and human evaluation of code documentation generation quality
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7664 https://ink.library.smu.edu.sg/context/sis_research/article/8667/viewcontent/tosem218.pdf
_version_	1770576410204700672

Correlating automated and human evaluation of code documentation generation quality

Similar Items