Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?

GitHub is one of the most popular platforms forversion control and collaboration. In GitHub, developers are ableto assign related topics to their repositories, which is helpfulfor finding similar repositories. The topics that are assigned torepositories are varied and provide salient descriptions of...

Full description

Saved in:
Bibliographic Details
Main Authors: WIDYASARI, Ratnadira, ZHAO, Zhipeng, CONG, Thanh Le, KANG, Hong Jin, LO, David
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8576
https://ink.library.smu.edu.sg/context/sis_research/article/9579/viewcontent/SANER23_GithubTopic.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9579
record_format dspace
spelling sg-smu-ink.sis_research-95792024-01-25T08:58:17Z Topic recommendation for GitHub repositories: How far can extreme multi-label learning go? WIDYASARI, Ratnadira ZHAO, Zhipeng CONG, Thanh Le KANG, Hong Jin LO, David GitHub is one of the most popular platforms forversion control and collaboration. In GitHub, developers are ableto assign related topics to their repositories, which is helpfulfor finding similar repositories. The topics that are assigned torepositories are varied and provide salient descriptions of therepository; some topics describe the technology employed in aproject, while others describe functionality of the project, itsgoals, and its features. Topics are part of the metadata of arepository and are useful for the organization and discoverabilityof the repository. However, the number of topics is large andthis makes it challenging to assign a relevant set of topics to arepository. While prior studies filter out infrequently occurringtopics before their experiments, we find that these topics formthe majority of the data.In this study, we try to address the problem of identifying thetopics from a GitHub repository by treating it as an extrememulti-label learning (XML) problem. We collect data of 21KGitHub repositories containing 37K labels of topics. The mainchallenge for XML is a large number of possible labels andsevere data sparsity which fit the challenge of identification oftopics from the GitHub repository. We evaluate multiple XMLtechniques, such as Parabel, Bonsai, LightXML, and ZestXML.We then perform an analysis of the different models proposed forXML classification. The best results on all the metrics from XMLmodels are from ZestXML which is a combination of zero-shotand XML. We also compare the performance of ZestXML witha baseline from a recent study. The results show that ZestXMLimproves the baseline in terms of the average F1-score by 17.35%.We also find that for the repositories that have topics thatrarely appear in the repositories used during training, ZestXMLimproves the performance greatly. The average of F1-score is 3times higher as compared to the baseline for the topics with 20or less occurrences in training data. 2023-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8576 info:doi/10.1109/SANER56733.2023.00025 https://ink.library.smu.edu.sg/context/sis_research/article/9579/viewcontent/SANER23_GithubTopic.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Multi-label classification Extreme multi-label learning Topic recommendation GitHub repositories Artificial Intelligence and Robotics Databases and Information Systems
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Multi-label classification
Extreme multi-label learning
Topic recommendation
GitHub repositories
Artificial Intelligence and Robotics
Databases and Information Systems
spellingShingle Multi-label classification
Extreme multi-label learning
Topic recommendation
GitHub repositories
Artificial Intelligence and Robotics
Databases and Information Systems
WIDYASARI, Ratnadira
ZHAO, Zhipeng
CONG, Thanh Le
KANG, Hong Jin
LO, David
Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?
description GitHub is one of the most popular platforms forversion control and collaboration. In GitHub, developers are ableto assign related topics to their repositories, which is helpfulfor finding similar repositories. The topics that are assigned torepositories are varied and provide salient descriptions of therepository; some topics describe the technology employed in aproject, while others describe functionality of the project, itsgoals, and its features. Topics are part of the metadata of arepository and are useful for the organization and discoverabilityof the repository. However, the number of topics is large andthis makes it challenging to assign a relevant set of topics to arepository. While prior studies filter out infrequently occurringtopics before their experiments, we find that these topics formthe majority of the data.In this study, we try to address the problem of identifying thetopics from a GitHub repository by treating it as an extrememulti-label learning (XML) problem. We collect data of 21KGitHub repositories containing 37K labels of topics. The mainchallenge for XML is a large number of possible labels andsevere data sparsity which fit the challenge of identification oftopics from the GitHub repository. We evaluate multiple XMLtechniques, such as Parabel, Bonsai, LightXML, and ZestXML.We then perform an analysis of the different models proposed forXML classification. The best results on all the metrics from XMLmodels are from ZestXML which is a combination of zero-shotand XML. We also compare the performance of ZestXML witha baseline from a recent study. The results show that ZestXMLimproves the baseline in terms of the average F1-score by 17.35%.We also find that for the repositories that have topics thatrarely appear in the repositories used during training, ZestXMLimproves the performance greatly. The average of F1-score is 3times higher as compared to the baseline for the topics with 20or less occurrences in training data.
format text
author WIDYASARI, Ratnadira
ZHAO, Zhipeng
CONG, Thanh Le
KANG, Hong Jin
LO, David
author_facet WIDYASARI, Ratnadira
ZHAO, Zhipeng
CONG, Thanh Le
KANG, Hong Jin
LO, David
author_sort WIDYASARI, Ratnadira
title Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?
title_short Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?
title_full Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?
title_fullStr Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?
title_full_unstemmed Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?
title_sort topic recommendation for github repositories: how far can extreme multi-label learning go?
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8576
https://ink.library.smu.edu.sg/context/sis_research/article/9579/viewcontent/SANER23_GithubTopic.pdf
_version_ 1789483279178530816