Can normalization methods allow escape from the doppelgänger effect in biomedical data?

The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain...

Full description

Saved in:

Bibliographic Details
Main Author:	Guo, Zexi
Other Authors:	Goh Wen Bin Wilson
Format:	Thesis-Master by Research
Language:	English
Published:	Nanyang Technological University 2023
Subjects:	Science::Biological sciences
Online Access:	https://hdl.handle.net/10356/165285
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain explanations. Here, we explore interactions between data normalization and DE. Although each normalization method produces different data distributions, they ultimately preserve rank orderings within each sample. It turns out that rank information alone is sufficient to induce high mutual correlations between samples. The only exception is the Gene Fuzzy Scoring (GFS) approach which impacts both scale and rank. Although GFS reduces mutual correlations, it does not provide an escape from DE, leading us to suspect that current approaches of identifying Data Doppelganger lack sensitivity. Contrary to previous reports, we find that GFS has reduced feature selection stability. However, GFS produces highly stable ML models which are also phenotypically relevant. We believe that combining GFS with current doppelganger mitigation measures may be a compelling synergistic approach towards biomedical data modeling.

Can normalization methods allow escape from the doppelgänger effect in biomedical data?

Similar Items