Doppelgänger spotting in biomedical gene expression data

Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, th...

Full description

Saved in:
Bibliographic Details
Main Authors: Wang, Li Rong, Choy, Xin Yun, Goh, Wilson Wen Bin
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/164208
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-164208
record_format dspace
spelling sg-ntu-dr.10356-1642082023-02-28T17:12:42Z Doppelgänger spotting in biomedical gene expression data Wang, Li Rong Choy, Xin Yun Goh, Wilson Wen Bin School of Computer Science and Engineering School of Biological Sciences Lee Kong Chian School of Medicine (LKCMedicine) Centre for Biomedical Informatics, NTU Science::Biological sciences Engineering::Computer science and engineering Bioinformatics Genomics Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier, a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights. National Research Foundation (NRF) Published version This research/project is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Pre-positioning (IAF-PP) Funding Initiative. 2023-01-09T08:19:28Z 2023-01-09T08:19:28Z 2022 Journal Article Wang, L. R., Choy, X. Y. & Goh, W. W. B. (2022). Doppelgänger spotting in biomedical gene expression data. IScience, 25(8), 104788-. https://dx.doi.org/10.1016/j.isci.2022.104788 2589-0042 https://hdl.handle.net/10356/164208 10.1016/j.isci.2022.104788 35992056 2-s2.0-85135695461 8 25 104788 en iScience © 2022 The Author(s). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Science::Biological sciences
Engineering::Computer science and engineering
Bioinformatics
Genomics
spellingShingle Science::Biological sciences
Engineering::Computer science and engineering
Bioinformatics
Genomics
Wang, Li Rong
Choy, Xin Yun
Goh, Wilson Wen Bin
Doppelgänger spotting in biomedical gene expression data
description Doppelgänger effects (DEs) occur when samples exhibit chance similarities such that, when split across training and validation sets, inflates the trained machine learning (ML) model performance. This inflationary effect causes misleading confidence on the deployability of the model. Thus, so far, there are no tools for doppelgänger identification or standard practices to manage their confounding implications. We present doppelgangerIdentifier, a software suite for doppelgänger identification and verification. Applying doppelgangerIdentifier across a multitude of diseases and data types, we show the pervasive nature of DEs in biomedical gene expression data. We also provide guidelines toward proper doppelgänger identification by exploring the ramifications of lingering batch effects from batch imbalances on the sensitivity of our doppelgänger identification algorithm. We suggest doppelgänger verification as a useful procedure to establish baselines for model evaluation that may inform on whether feature selection and ML on the data set may yield meaningful insights.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Wang, Li Rong
Choy, Xin Yun
Goh, Wilson Wen Bin
format Article
author Wang, Li Rong
Choy, Xin Yun
Goh, Wilson Wen Bin
author_sort Wang, Li Rong
title Doppelgänger spotting in biomedical gene expression data
title_short Doppelgänger spotting in biomedical gene expression data
title_full Doppelgänger spotting in biomedical gene expression data
title_fullStr Doppelgänger spotting in biomedical gene expression data
title_full_unstemmed Doppelgänger spotting in biomedical gene expression data
title_sort doppelgänger spotting in biomedical gene expression data
publishDate 2023
url https://hdl.handle.net/10356/164208
_version_ 1759857805776388096