NICHE: A curated dataset of engineered machine learning projects in Python

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The lim...

Full description

Saved in:
Bibliographic Details
Main Authors: WIDYASARI, Ratnadira, YANG, Zhou, THUNG, Ferdian, SIM, Sheng Qin, WEE, Fiona, LOK, Camellia, PHAN, Jack, QI, Haodi, TAN, Constance, LO, David, David LO
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8570
https://ink.library.smu.edu.sg/context/sis_research/article/9573/viewcontent/niche.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such a high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on the evidence of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. This dataset can help researchers understand the practices that are adopted in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.