How doppelgänger effects in biomedical data confound machine learning

Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgä...

全面介紹

Saved in:
書目詳細資料
Main Authors: Wang, Li Rong, Wong, Limsoon, Goh, Wilson Wen Bin
其他作者: Lee Kong Chian School of Medicine (LKCMedicine)
格式: Article
語言:English
出版: 2022
主題:
在線閱讀:https://hdl.handle.net/10356/155991
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
實物特徵
總結:Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.