How doppelgänger effects in biomedical data confound machine learning
Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgä...
Saved in:
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/155991 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-155991 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1559912023-02-28T17:10:53Z How doppelgänger effects in biomedical data confound machine learning Wang, Li Rong Wong, Limsoon Goh, Wilson Wen Bin Lee Kong Chian School of Medicine (LKCMedicine) School of Computer Science and Engineering School of Biological Sciences Engineering::Computer science and engineering Computational Biology Data Science Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split. Ministry of Education (MOE) National Research Foundation (NRF) Submitted/Accepted version This research/project is supported by the National Research Foundation, Singapore under its Industry Alignment Fund – Prepositioning (IAF-PP) Funding Initiative. W.W.B.G. also acknowledges support from a Ministry of Education (MOE), Singapore Tier 1 grant (Grant No. RG35/20). 2022-03-30T01:52:31Z 2022-03-30T01:52:31Z 2022 Journal Article Wang, L. R., Wong, L. & Goh, W. W. B. (2022). How doppelgänger effects in biomedical data confound machine learning. Drug Discovery Today, 27(3), 678-685. https://dx.doi.org/10.1016/j.drudis.2021.10.017 1359-6446 https://hdl.handle.net/10356/155991 10.1016/j.drudis.2021.10.017 34743902 2-s2.0-85118879305 3 27 678 685 en RG35/20 Drug Discovery Today © 2021 Elsevier Ltd. All rights reserved. This paper was published in Drug Discovery Today and is made available with permission of Elsevier Ltd. application/pdf application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering Computational Biology Data Science |
spellingShingle |
Engineering::Computer science and engineering Computational Biology Data Science Wang, Li Rong Wong, Limsoon Goh, Wilson Wen Bin How doppelgänger effects in biomedical data confound machine learning |
description |
Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split. |
author2 |
Lee Kong Chian School of Medicine (LKCMedicine) |
author_facet |
Lee Kong Chian School of Medicine (LKCMedicine) Wang, Li Rong Wong, Limsoon Goh, Wilson Wen Bin |
format |
Article |
author |
Wang, Li Rong Wong, Limsoon Goh, Wilson Wen Bin |
author_sort |
Wang, Li Rong |
title |
How doppelgänger effects in biomedical data confound machine learning |
title_short |
How doppelgänger effects in biomedical data confound machine learning |
title_full |
How doppelgänger effects in biomedical data confound machine learning |
title_fullStr |
How doppelgänger effects in biomedical data confound machine learning |
title_full_unstemmed |
How doppelgänger effects in biomedical data confound machine learning |
title_sort |
how doppelgänger effects in biomedical data confound machine learning |
publishDate |
2022 |
url |
https://hdl.handle.net/10356/155991 |
_version_ |
1759855413815148544 |