Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?

GitHub is one of the most popular platforms forversion control and collaboration. In GitHub, developers are ableto assign related topics to their repositories, which is helpfulfor finding similar repositories. The topics that are assigned torepositories are varied and provide salient descriptions of...

Full description

Saved in:
Bibliographic Details
Main Authors: WIDYASARI, Ratnadira, ZHAO, Zhipeng, CONG, Thanh Le, KANG, Hong Jin, LO, David
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8576
https://ink.library.smu.edu.sg/context/sis_research/article/9579/viewcontent/SANER23_GithubTopic.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:GitHub is one of the most popular platforms forversion control and collaboration. In GitHub, developers are ableto assign related topics to their repositories, which is helpfulfor finding similar repositories. The topics that are assigned torepositories are varied and provide salient descriptions of therepository; some topics describe the technology employed in aproject, while others describe functionality of the project, itsgoals, and its features. Topics are part of the metadata of arepository and are useful for the organization and discoverabilityof the repository. However, the number of topics is large andthis makes it challenging to assign a relevant set of topics to arepository. While prior studies filter out infrequently occurringtopics before their experiments, we find that these topics formthe majority of the data.In this study, we try to address the problem of identifying thetopics from a GitHub repository by treating it as an extrememulti-label learning (XML) problem. We collect data of 21KGitHub repositories containing 37K labels of topics. The mainchallenge for XML is a large number of possible labels andsevere data sparsity which fit the challenge of identification oftopics from the GitHub repository. We evaluate multiple XMLtechniques, such as Parabel, Bonsai, LightXML, and ZestXML.We then perform an analysis of the different models proposed forXML classification. The best results on all the metrics from XMLmodels are from ZestXML which is a combination of zero-shotand XML. We also compare the performance of ZestXML witha baseline from a recent study. The results show that ZestXMLimproves the baseline in terms of the average F1-score by 17.35%.We also find that for the repositories that have topics thatrarely appear in the repositories used during training, ZestXMLimproves the performance greatly. The average of F1-score is 3times higher as compared to the baseline for the topics with 20or less occurrences in training data.