Can normalization methods allow escape from the doppelgänger effect in biomedical data?

The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain...

Full description

Saved in:

Bibliographic Details
Main Author:	Guo, Zexi
Other Authors:	Goh Wen Bin Wilson
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Science::Biological sciences
Online Access:	https://hdl.handle.net/10356/165285
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-165285
record_format	dspace
spelling	sg-ntu-dr.10356-1652852023-04-04T02:58:00Z Can normalization methods allow escape from the doppelgänger effect in biomedical data? Guo, Zexi Goh Wen Bin Wilson School of Biological Sciences wilsongoh@ntu.edu.sg Science::Biological sciences The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain explanations. Here, we explore interactions between data normalization and DE. Although each normalization method produces different data distributions, they ultimately preserve rank orderings within each sample. It turns out that rank information alone is sufficient to induce high mutual correlations between samples. The only exception is the Gene Fuzzy Scoring (GFS) approach which impacts both scale and rank. Although GFS reduces mutual correlations, it does not provide an escape from DE, leading us to suspect that current approaches of identifying Data Doppelganger lack sensitivity. Contrary to previous reports, we find that GFS has reduced feature selection stability. However, GFS produces highly stable ML models which are also phenotypically relevant. We believe that combining GFS with current doppelganger mitigation measures may be a compelling synergistic approach towards biomedical data modeling. Master of Science 2023-03-23T01:29:01Z 2023-03-23T01:29:01Z 2023 Thesis-Master by Research Guo, Z. (2023). Can normalization methods allow escape from the doppelgänger effect in biomedical data?. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165285 https://hdl.handle.net/10356/165285 10.32657/10356/165285 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Science::Biological sciences
spellingShingle	Science::Biological sciences Guo, Zexi Can normalization methods allow escape from the doppelgänger effect in biomedical data?
description	The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain explanations. Here, we explore interactions between data normalization and DE. Although each normalization method produces different data distributions, they ultimately preserve rank orderings within each sample. It turns out that rank information alone is sufficient to induce high mutual correlations between samples. The only exception is the Gene Fuzzy Scoring (GFS) approach which impacts both scale and rank. Although GFS reduces mutual correlations, it does not provide an escape from DE, leading us to suspect that current approaches of identifying Data Doppelganger lack sensitivity. Contrary to previous reports, we find that GFS has reduced feature selection stability. However, GFS produces highly stable ML models which are also phenotypically relevant. We believe that combining GFS with current doppelganger mitigation measures may be a compelling synergistic approach towards biomedical data modeling.
author2	Goh Wen Bin Wilson
author_facet	Goh Wen Bin Wilson Guo, Zexi
format	Thesis-Master by Research
author	Guo, Zexi
author_sort	Guo, Zexi
title	Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_short	Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_full	Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_fullStr	Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_full_unstemmed	Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_sort	can normalization methods allow escape from the doppelgänger effect in biomedical data?
publisher	Nanyang Technological University
publishDate	2023
url	https://hdl.handle.net/10356/165285
_version_	1764208044495863808

Can normalization methods allow escape from the doppelgänger effect in biomedical data?

Similar Items