Can normalization methods allow escape from the doppelgänger effect in biomedical data?

The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain...

Full description

Saved in:
Bibliographic Details
Main Author: Guo, Zexi
Other Authors: Goh Wen Bin Wilson
Format: Thesis-Master by Research
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/165285
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-165285
record_format dspace
spelling sg-ntu-dr.10356-1652852023-04-04T02:58:00Z Can normalization methods allow escape from the doppelgänger effect in biomedical data? Guo, Zexi Goh Wen Bin Wilson School of Biological Sciences wilsongoh@ntu.edu.sg Science::Biological sciences The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain explanations. Here, we explore interactions between data normalization and DE. Although each normalization method produces different data distributions, they ultimately preserve rank orderings within each sample. It turns out that rank information alone is sufficient to induce high mutual correlations between samples. The only exception is the Gene Fuzzy Scoring (GFS) approach which impacts both scale and rank. Although GFS reduces mutual correlations, it does not provide an escape from DE, leading us to suspect that current approaches of identifying Data Doppelganger lack sensitivity. Contrary to previous reports, we find that GFS has reduced feature selection stability. However, GFS produces highly stable ML models which are also phenotypically relevant. We believe that combining GFS with current doppelganger mitigation measures may be a compelling synergistic approach towards biomedical data modeling. Master of Science 2023-03-23T01:29:01Z 2023-03-23T01:29:01Z 2023 Thesis-Master by Research Guo, Z. (2023). Can normalization methods allow escape from the doppelgänger effect in biomedical data?. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165285 https://hdl.handle.net/10356/165285 10.32657/10356/165285 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Science::Biological sciences
spellingShingle Science::Biological sciences
Guo, Zexi
Can normalization methods allow escape from the doppelgänger effect in biomedical data?
description The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain explanations. Here, we explore interactions between data normalization and DE. Although each normalization method produces different data distributions, they ultimately preserve rank orderings within each sample. It turns out that rank information alone is sufficient to induce high mutual correlations between samples. The only exception is the Gene Fuzzy Scoring (GFS) approach which impacts both scale and rank. Although GFS reduces mutual correlations, it does not provide an escape from DE, leading us to suspect that current approaches of identifying Data Doppelganger lack sensitivity. Contrary to previous reports, we find that GFS has reduced feature selection stability. However, GFS produces highly stable ML models which are also phenotypically relevant. We believe that combining GFS with current doppelganger mitigation measures may be a compelling synergistic approach towards biomedical data modeling.
author2 Goh Wen Bin Wilson
author_facet Goh Wen Bin Wilson
Guo, Zexi
format Thesis-Master by Research
author Guo, Zexi
author_sort Guo, Zexi
title Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_short Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_full Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_fullStr Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_full_unstemmed Can normalization methods allow escape from the doppelgänger effect in biomedical data?
title_sort can normalization methods allow escape from the doppelgänger effect in biomedical data?
publisher Nanyang Technological University
publishDate 2023
url https://hdl.handle.net/10356/165285
_version_ 1764208044495863808