NICHE: A curated dataset of engineered machine learning projects in Python

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The lim...

Full description

Saved in:
Bibliographic Details
Main Authors: WIDYASARI, Ratnadira, YANG, Zhou, THUNG, Ferdian, SIM, Sheng Qin, WEE, Fiona, LOK, Camellia, PHAN, Jack, QI, Haodi, TAN, Constance, LO, David, David LO
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2023
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/8570
https://ink.library.smu.edu.sg/context/sis_research/article/9573/viewcontent/niche.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
id sg-smu-ink.sis_research-9573
record_format dspace
spelling sg-smu-ink.sis_research-95732024-01-25T09:00:10Z NICHE: A curated dataset of engineered machine learning projects in Python WIDYASARI, Ratnadira YANG, Zhou THUNG, Ferdian SIM, Sheng Qin WEE, Fiona LOK, Camellia PHAN, Jack QI, Haodi TAN, Constance LO, David David LO, Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such a high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on the evidence of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. This dataset can help researchers understand the practices that are adopted in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects. 2023-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8570 info:doi/10.1109/MSR59073.2023.00022 https://ink.library.smu.edu.sg/context/sis_research/article/9573/viewcontent/niche.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Daily lives Engineered software project High quality Labeled dataset Learning projects Machine-learning Open source platforms Open source projects Software engineering practices Software project Computer and Systems Architecture Databases and Information Systems Software Engineering
institution Singapore Management University
building SMU Libraries
continent Asia
country Singapore
Singapore
content_provider SMU Libraries
collection InK@SMU
language English
topic Daily lives
Engineered software project
High quality
Labeled dataset
Learning projects
Machine-learning
Open source platforms
Open source projects
Software engineering practices
Software project
Computer and Systems Architecture
Databases and Information Systems
Software Engineering
spellingShingle Daily lives
Engineered software project
High quality
Labeled dataset
Learning projects
Machine-learning
Open source platforms
Open source projects
Software engineering practices
Software project
Computer and Systems Architecture
Databases and Information Systems
Software Engineering
WIDYASARI, Ratnadira
YANG, Zhou
THUNG, Ferdian
SIM, Sheng Qin
WEE, Fiona
LOK, Camellia
PHAN, Jack
QI, Haodi
TAN, Constance
LO, David
David LO,
NICHE: A curated dataset of engineered machine learning projects in Python
description Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such a high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on the evidence of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. This dataset can help researchers understand the practices that are adopted in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
format text
author WIDYASARI, Ratnadira
YANG, Zhou
THUNG, Ferdian
SIM, Sheng Qin
WEE, Fiona
LOK, Camellia
PHAN, Jack
QI, Haodi
TAN, Constance
LO, David
David LO,
author_facet WIDYASARI, Ratnadira
YANG, Zhou
THUNG, Ferdian
SIM, Sheng Qin
WEE, Fiona
LOK, Camellia
PHAN, Jack
QI, Haodi
TAN, Constance
LO, David
David LO,
author_sort WIDYASARI, Ratnadira
title NICHE: A curated dataset of engineered machine learning projects in Python
title_short NICHE: A curated dataset of engineered machine learning projects in Python
title_full NICHE: A curated dataset of engineered machine learning projects in Python
title_fullStr NICHE: A curated dataset of engineered machine learning projects in Python
title_full_unstemmed NICHE: A curated dataset of engineered machine learning projects in Python
title_sort niche: a curated dataset of engineered machine learning projects in python
publisher Institutional Knowledge at Singapore Management University
publishDate 2023
url https://ink.library.smu.edu.sg/sis_research/8570
https://ink.library.smu.edu.sg/context/sis_research/article/9573/viewcontent/niche.pdf
_version_ 1789483278086963200