A large scale study of long-time contributor prediction for GitHub projects

The continuous contributions made by long time contributors (LTCs) are a key factor enabling open source software (OSS) projects to be successful and survival. We study Github as it has a large number of OSS projects and millions of contributors, which enables the study of the transition from newcom...

Full description

Saved in:
Bibliographic Details
Main Authors: BAO, Lingfeng, XIA, Xin, LO, David, MURPHY, Gail C.
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2021
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/4359
https://ink.library.smu.edu.sg/context/sis_research/article/5362/viewcontent/Long_time_contributor_Github_tse191.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-5362
record_format dspace
spelling sg-smu-ink.sis_research-53622021-11-16T06:38:43Z A large scale study of long-time contributor prediction for GitHub projects BAO, Lingfeng XIA, Xin LO, David MURPHY, Gail C. The continuous contributions made by long time contributors (LTCs) are a key factor enabling open source software (OSS) projects to be successful and survival. We study Github as it has a large number of OSS projects and millions of contributors, which enables the study of the transition from newcomers to LTCs. In this paper, we investigate whether we can effectively predict newcomers in OSS projects to be LTCs based on their activity data that is collected from Github. We collect Github data from GHTorrent, a mirror of Github data. We select the most popular 917 projects, which contain 75,046 contributors. We determine a developer as a LTC of a project if the time interval between his/her first and last commit in the project is larger than a certain time TT. In our experiment, we use three different settings on the time interval: 1, 2, and 3 years. There are 9,238, 3,968, and 1,577 contributors who become LTCs of a project in three settings of time interval, respectively. To build a prediction model, we extract many features from the activities of developers on Github, which group into five dimensions: developer profile, repository profile, developer monthly activity, repository monthly activity, and collaboration network. We apply several classifiers including naive Bayes, SVM, decision tree, kNN and random forest. We find that random forest classifier achieves the best performance with AUCs of more than 0.75 in all three settings of time interval for LTCs. We also investigate the most important features that differentiate newcomers who become LTCs from newcomers who stay in the projects for a short time. We find that the number of followers is the most important feature in all three settings of the time interval studied. We also find that the programming language and the average number of commits contributed by other developers when a newcomer joins a project also belong to the top 10 most important features in all three settings of time interval for LTCs. Finally, we provide several implications for action based on our analysis results to help OSS projects retain newcomers. 2021-06-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/4359 info:doi/10.1109/TSE.2019.2918536 https://ink.library.smu.edu.sg/context/sis_research/article/5362/viewcontent/Long_time_contributor_Github_tse191.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Long Time Contributor GitHub Prediction Model Feature extraction Task analysis Numerical Analysis and Scientific Computing Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Long Time Contributor
GitHub
Prediction Model
Feature extraction
Task analysis
Numerical Analysis and Scientific Computing
Software Engineering
spellingShingle Long Time Contributor
GitHub
Prediction Model
Feature extraction
Task analysis
Numerical Analysis and Scientific Computing
Software Engineering
BAO, Lingfeng
XIA, Xin
LO, David
MURPHY, Gail C.
A large scale study of long-time contributor prediction for GitHub projects
description The continuous contributions made by long time contributors (LTCs) are a key factor enabling open source software (OSS) projects to be successful and survival. We study Github as it has a large number of OSS projects and millions of contributors, which enables the study of the transition from newcomers to LTCs. In this paper, we investigate whether we can effectively predict newcomers in OSS projects to be LTCs based on their activity data that is collected from Github. We collect Github data from GHTorrent, a mirror of Github data. We select the most popular 917 projects, which contain 75,046 contributors. We determine a developer as a LTC of a project if the time interval between his/her first and last commit in the project is larger than a certain time TT. In our experiment, we use three different settings on the time interval: 1, 2, and 3 years. There are 9,238, 3,968, and 1,577 contributors who become LTCs of a project in three settings of time interval, respectively. To build a prediction model, we extract many features from the activities of developers on Github, which group into five dimensions: developer profile, repository profile, developer monthly activity, repository monthly activity, and collaboration network. We apply several classifiers including naive Bayes, SVM, decision tree, kNN and random forest. We find that random forest classifier achieves the best performance with AUCs of more than 0.75 in all three settings of time interval for LTCs. We also investigate the most important features that differentiate newcomers who become LTCs from newcomers who stay in the projects for a short time. We find that the number of followers is the most important feature in all three settings of the time interval studied. We also find that the programming language and the average number of commits contributed by other developers when a newcomer joins a project also belong to the top 10 most important features in all three settings of time interval for LTCs. Finally, we provide several implications for action based on our analysis results to help OSS projects retain newcomers.
format text
author BAO, Lingfeng
XIA, Xin
LO, David
MURPHY, Gail C.
author_facet BAO, Lingfeng
XIA, Xin
LO, David
MURPHY, Gail C.
author_sort BAO, Lingfeng
title A large scale study of long-time contributor prediction for GitHub projects
title_short A large scale study of long-time contributor prediction for GitHub projects
title_full A large scale study of long-time contributor prediction for GitHub projects
title_fullStr A large scale study of long-time contributor prediction for GitHub projects
title_full_unstemmed A large scale study of long-time contributor prediction for GitHub projects
title_sort large scale study of long-time contributor prediction for github projects
publisher Institutional Knowledge at Singapore Management University
publishDate 2021
url https://ink.library.smu.edu.sg/sis_research/4359
https://ink.library.smu.edu.sg/context/sis_research/article/5362/viewcontent/Long_time_contributor_Github_tse191.pdf
_version_ 1770574686036426752