Detecting similar repositories on GitHub

GitHub contains millions of repositories among which many are similar with one another (i.e., having similar source codes or implementing similar functionalities). Finding similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, build prototypes, id...

Full description

Saved in:
Bibliographic Details
Main Authors: ZHANG, Yun, David LO, PAVNEET SINGH KOCHHAR, XIA, Xin, LI, Quanlai, SUN, Jianling
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2017
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/3615
https://ink.library.smu.edu.sg/context/sis_research/article/4616/viewcontent/Detecting_Similar_Repositories_on_GitHub_2017_SANER.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-4616
record_format dspace
spelling sg-smu-ink.sis_research-46162018-12-07T03:16:57Z Detecting similar repositories on GitHub ZHANG, Yun David LO, PAVNEET SINGH KOCHHAR, XIA, Xin LI, Quanlai SUN, Jianling GitHub contains millions of repositories among which many are similar with one another (i.e., having similar source codes or implementing similar functionalities). Finding similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, build prototypes, identify alternative implementations, explore related projects, find projects to contribute to, and discover code theft and plagiarism. Previous studies have proposed techniques to detect similar applications by analyzing API usage patterns and software tags. However, these prior studies either only make use of a limited source of information or use information not available for projects on GitHub. In this paper, we propose a novel approach that can effectively detect similar repositories on GitHub. Our approach is designed based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works. The three heuristics are: repositories whose readme files contain similar contents are likely to be similar with one another, repositories starred by users of similar interests are likely to be similar, and repositories starred together within a short period of time by the same user are likely to be similar. Based on these three heuristics, we compute three relevance scores (i.e., readme-based relevance, stargazer-based relevance, and time-based relevance) to assess the similarity between two repositories. By integrating the three relevance scores, we build a recommendation system called RepoPal to detect similar repositories. We compare RepoPal to a prior state-of-the-art approach CLAN using one thousand Java repositories on GitHub. Our empirical evaluation demonstrates that RepoPal achieves a higher success rate, precision and confidence over CLAN. 2017-02-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/3615 info:doi/10.1109/SANER.2017.7884605 https://ink.library.smu.edu.sg/context/sis_research/article/4616/viewcontent/Detecting_Similar_Repositories_on_GitHub_2017_SANER.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Recommendation System Similar Repositories GitHub Information Retrieval search engines Databases and Information Systems Software Engineering Systems Architecture
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Recommendation System
Similar Repositories
GitHub
Information Retrieval
search engines
Databases and Information Systems
Software Engineering
Systems Architecture
spellingShingle Recommendation System
Similar Repositories
GitHub
Information Retrieval
search engines
Databases and Information Systems
Software Engineering
Systems Architecture
ZHANG, Yun
David LO,
PAVNEET SINGH KOCHHAR,
XIA, Xin
LI, Quanlai
SUN, Jianling
Detecting similar repositories on GitHub
description GitHub contains millions of repositories among which many are similar with one another (i.e., having similar source codes or implementing similar functionalities). Finding similar repositories on GitHub can be helpful for software engineers as it can help them reuse source code, build prototypes, identify alternative implementations, explore related projects, find projects to contribute to, and discover code theft and plagiarism. Previous studies have proposed techniques to detect similar applications by analyzing API usage patterns and software tags. However, these prior studies either only make use of a limited source of information or use information not available for projects on GitHub. In this paper, we propose a novel approach that can effectively detect similar repositories on GitHub. Our approach is designed based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works. The three heuristics are: repositories whose readme files contain similar contents are likely to be similar with one another, repositories starred by users of similar interests are likely to be similar, and repositories starred together within a short period of time by the same user are likely to be similar. Based on these three heuristics, we compute three relevance scores (i.e., readme-based relevance, stargazer-based relevance, and time-based relevance) to assess the similarity between two repositories. By integrating the three relevance scores, we build a recommendation system called RepoPal to detect similar repositories. We compare RepoPal to a prior state-of-the-art approach CLAN using one thousand Java repositories on GitHub. Our empirical evaluation demonstrates that RepoPal achieves a higher success rate, precision and confidence over CLAN.
format text
author ZHANG, Yun
David LO,
PAVNEET SINGH KOCHHAR,
XIA, Xin
LI, Quanlai
SUN, Jianling
author_facet ZHANG, Yun
David LO,
PAVNEET SINGH KOCHHAR,
XIA, Xin
LI, Quanlai
SUN, Jianling
author_sort ZHANG, Yun
title Detecting similar repositories on GitHub
title_short Detecting similar repositories on GitHub
title_full Detecting similar repositories on GitHub
title_fullStr Detecting similar repositories on GitHub
title_full_unstemmed Detecting similar repositories on GitHub
title_sort detecting similar repositories on github
publisher Institutional Knowledge at Singapore Management University
publishDate 2017
url https://ink.library.smu.edu.sg/sis_research/3615
https://ink.library.smu.edu.sg/context/sis_research/article/4616/viewcontent/Detecting_Similar_Repositories_on_GitHub_2017_SANER.pdf
_version_ 1770573362416844800