NICHE: A curated dataset of engineered machine learning projects in Python
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The lim...
Saved in:
Main Authors: | , , , , , , , , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2023
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/sis_research/8570 https://ink.library.smu.edu.sg/context/sis_research/article/9573/viewcontent/niche.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
id |
sg-smu-ink.sis_research-9573 |
---|---|
record_format |
dspace |
spelling |
sg-smu-ink.sis_research-95732024-01-25T09:00:10Z NICHE: A curated dataset of engineered machine learning projects in Python WIDYASARI, Ratnadira YANG, Zhou THUNG, Ferdian SIM, Sheng Qin WEE, Fiona LOK, Camellia PHAN, Jack QI, Haodi TAN, Constance LO, David David LO, Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such a high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on the evidence of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. This dataset can help researchers understand the practices that are adopted in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects. 2023-05-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/8570 info:doi/10.1109/MSR59073.2023.00022 https://ink.library.smu.edu.sg/context/sis_research/article/9573/viewcontent/niche.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Daily lives Engineered software project High quality Labeled dataset Learning projects Machine-learning Open source platforms Open source projects Software engineering practices Software project Computer and Systems Architecture Databases and Information Systems Software Engineering |
institution |
Singapore Management University |
building |
SMU Libraries |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
SMU Libraries |
collection |
InK@SMU |
language |
English |
topic |
Daily lives Engineered software project High quality Labeled dataset Learning projects Machine-learning Open source platforms Open source projects Software engineering practices Software project Computer and Systems Architecture Databases and Information Systems Software Engineering |
spellingShingle |
Daily lives Engineered software project High quality Labeled dataset Learning projects Machine-learning Open source platforms Open source projects Software engineering practices Software project Computer and Systems Architecture Databases and Information Systems Software Engineering WIDYASARI, Ratnadira YANG, Zhou THUNG, Ferdian SIM, Sheng Qin WEE, Fiona LOK, Camellia PHAN, Jack QI, Haodi TAN, Constance LO, David David LO, NICHE: A curated dataset of engineered machine learning projects in Python |
description |
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such a high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on the evidence of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. This dataset can help researchers understand the practices that are adopted in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects. |
format |
text |
author |
WIDYASARI, Ratnadira YANG, Zhou THUNG, Ferdian SIM, Sheng Qin WEE, Fiona LOK, Camellia PHAN, Jack QI, Haodi TAN, Constance LO, David David LO, |
author_facet |
WIDYASARI, Ratnadira YANG, Zhou THUNG, Ferdian SIM, Sheng Qin WEE, Fiona LOK, Camellia PHAN, Jack QI, Haodi TAN, Constance LO, David David LO, |
author_sort |
WIDYASARI, Ratnadira |
title |
NICHE: A curated dataset of engineered machine learning projects in Python |
title_short |
NICHE: A curated dataset of engineered machine learning projects in Python |
title_full |
NICHE: A curated dataset of engineered machine learning projects in Python |
title_fullStr |
NICHE: A curated dataset of engineered machine learning projects in Python |
title_full_unstemmed |
NICHE: A curated dataset of engineered machine learning projects in Python |
title_sort |
niche: a curated dataset of engineered machine learning projects in python |
publisher |
Institutional Knowledge at Singapore Management University |
publishDate |
2023 |
url |
https://ink.library.smu.edu.sg/sis_research/8570 https://ink.library.smu.edu.sg/context/sis_research/article/9573/viewcontent/niche.pdf |
_version_ |
1789483278086963200 |