Categorizing the content of GitHub README files

README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files auto...

Full description

Saved in:

Bibliographic Details
Main Authors:	PRANA, Gede Artha Azriadi, TREUDE, Christoph, THUNG, Ferdian, ATAPATTU, Thushari, LO, David
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2018
Subjects:	Classification GitHub README files Documentation Software Engineering
Online Access:	https://ink.library.smu.edu.sg/sis_research/4360 https://ink.library.smu.edu.sg/context/sis_research/article/5363/viewcontent/Github_readme_files_afv.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-5363
record_format	dspace
spelling	sg-smu-ink.sis_research-53632019-06-13T09:57:05Z Categorizing the content of GitHub README files PRANA, Gede Artha Azriadi TREUDE, Christoph THUNG, Ferdian ATAPATTU, Thushari LO, David README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the ‘What’ and ‘How’ of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the classification, we used the automatically determined classes to label sections in GitHub README files using badges and showed files with and without these badges to twenty software professionals. The majority of participants perceived the automated labeling of sections based on our classifier to ease information discovery. This work enables the owners of software repositories to improve the quality of their documentation and it has the potential to make it easier for the software development community to discover relevant information in GitHub README files. 2018-10-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4360 info:doi/10.1007/s10664-018-9660-3 https://ink.library.smu.edu.sg/context/sis_research/article/5363/viewcontent/Github_readme_files_afv.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Classification GitHub README files Documentation Software Engineering
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Classification GitHub README files Documentation Software Engineering
spellingShingle	Classification GitHub README files Documentation Software Engineering PRANA, Gede Artha Azriadi TREUDE, Christoph THUNG, Ferdian ATAPATTU, Thushari LO, David Categorizing the content of GitHub README files
description	README files play an essential role in shaping a developer’s first impression of a software repository and in documenting the software project that the repository hosts. Yet, we lack a systematic understanding of the content of a typical README file as well as tools that can process these files automatically. To close this gap, we conduct a qualitative study involving the manual annotation of 4,226 README file sections from 393 randomly sampled GitHub repositories and we design and evaluate a classifier and a set of features that can categorize these sections automatically. We find that information discussing the ‘What’ and ‘How’ of a repository is very common, while many README files lack information regarding the purpose and status of a repository. Our multi-label classifier which can predict eight different categories achieves an F1 score of 0.746. To evaluate the usefulness of the classification, we used the automatically determined classes to label sections in GitHub README files using badges and showed files with and without these badges to twenty software professionals. The majority of participants perceived the automated labeling of sections based on our classifier to ease information discovery. This work enables the owners of software repositories to improve the quality of their documentation and it has the potential to make it easier for the software development community to discover relevant information in GitHub README files.
format	text
author	PRANA, Gede Artha Azriadi TREUDE, Christoph THUNG, Ferdian ATAPATTU, Thushari LO, David
author_facet	PRANA, Gede Artha Azriadi TREUDE, Christoph THUNG, Ferdian ATAPATTU, Thushari LO, David
author_sort	PRANA, Gede Artha Azriadi
title	Categorizing the content of GitHub README files
title_short	Categorizing the content of GitHub README files
title_full	Categorizing the content of GitHub README files
title_fullStr	Categorizing the content of GitHub README files
title_full_unstemmed	Categorizing the content of GitHub README files
title_sort	categorizing the content of github readme files
publisher	Institutional Knowledge at Singapore Management University
publishDate	2018
url	https://ink.library.smu.edu.sg/sis_research/4360 https://ink.library.smu.edu.sg/context/sis_research/article/5363/viewcontent/Github_readme_files_afv.pdf
_version_	1770574686248239104

Categorizing the content of GitHub README files

Similar Items