How doppelgänger effects in biomedical data confound machine learning

Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgä...

Full description

Saved in:
Bibliographic Details
Main Authors: Wang, Li Rong, Wong, Limsoon, Goh, Wilson Wen Bin
Other Authors: Lee Kong Chian School of Medicine (LKCMedicine)
Format: Article
Language:English
Published: 2022
Subjects:
Online Access:https://hdl.handle.net/10356/155991
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-155991
record_format dspace
spelling sg-ntu-dr.10356-1559912023-02-28T17:10:53Z How doppelgänger effects in biomedical data confound machine learning Wang, Li Rong Wong, Limsoon Goh, Wilson Wen Bin Lee Kong Chian School of Medicine (LKCMedicine) School of Computer Science and Engineering School of Biological Sciences Engineering::Computer science and engineering Computational Biology Data Science Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split. Ministry of Education (MOE) National Research Foundation (NRF) Submitted/Accepted version This research/project is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Prepositioning (IAF-PP) Funding Initiative. W.W.B.G. also acknowledges support from a Ministry of Education (MOE), Singapore Tier 1 grant (Grant No. RG35/20). 2022-03-30T01:52:31Z 2022-03-30T01:52:31Z 2022 Journal Article Wang, L. R., Wong, L. & Goh, W. W. B. (2022). How doppelgänger effects in biomedical data confound machine learning. Drug Discovery Today, 27(3), 678-685. https://dx.doi.org/10.1016/j.drudis.2021.10.017 1359-6446 https://hdl.handle.net/10356/155991 10.1016/j.drudis.2021.10.017 34743902 2-s2.0-85118879305 3 27 678 685 en RG35/20 Drug Discovery Today © 2021 Elsevier Ltd. All rights reserved. This paper was published in Drug Discovery Today and is made available with permission of Elsevier Ltd. application/pdf application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Computational Biology
Data Science
spellingShingle Engineering::Computer science and engineering
Computational Biology
Data Science
Wang, Li Rong
Wong, Limsoon
Goh, Wilson Wen Bin
How doppelgänger effects in biomedical data confound machine learning
description Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.
author2 Lee Kong Chian School of Medicine (LKCMedicine)
author_facet Lee Kong Chian School of Medicine (LKCMedicine)
Wang, Li Rong
Wong, Limsoon
Goh, Wilson Wen Bin
format Article
author Wang, Li Rong
Wong, Limsoon
Goh, Wilson Wen Bin
author_sort Wang, Li Rong
title How doppelgänger effects in biomedical data confound machine learning
title_short How doppelgänger effects in biomedical data confound machine learning
title_full How doppelgänger effects in biomedical data confound machine learning
title_fullStr How doppelgänger effects in biomedical data confound machine learning
title_full_unstemmed How doppelgänger effects in biomedical data confound machine learning
title_sort how doppelgänger effects in biomedical data confound machine learning
publishDate 2022
url https://hdl.handle.net/10356/155991
_version_ 1759855413815148544