Can normalization methods allow escape from the doppelgänger effect in biomedical data?
The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/165285 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-165285 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1652852023-04-04T02:58:00Z Can normalization methods allow escape from the doppelgänger effect in biomedical data? Guo, Zexi Goh Wen Bin Wilson School of Biological Sciences wilsongoh@ntu.edu.sg Science::Biological sciences The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain explanations. Here, we explore interactions between data normalization and DE. Although each normalization method produces different data distributions, they ultimately preserve rank orderings within each sample. It turns out that rank information alone is sufficient to induce high mutual correlations between samples. The only exception is the Gene Fuzzy Scoring (GFS) approach which impacts both scale and rank. Although GFS reduces mutual correlations, it does not provide an escape from DE, leading us to suspect that current approaches of identifying Data Doppelganger lack sensitivity. Contrary to previous reports, we find that GFS has reduced feature selection stability. However, GFS produces highly stable ML models which are also phenotypically relevant. We believe that combining GFS with current doppelganger mitigation measures may be a compelling synergistic approach towards biomedical data modeling. Master of Science 2023-03-23T01:29:01Z 2023-03-23T01:29:01Z 2023 Thesis-Master by Research Guo, Z. (2023). Can normalization methods allow escape from the doppelgänger effect in biomedical data?. Master's thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/165285 https://hdl.handle.net/10356/165285 10.32657/10356/165285 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Science::Biological sciences |
spellingShingle |
Science::Biological sciences Guo, Zexi Can normalization methods allow escape from the doppelgänger effect in biomedical data? |
description |
The Doppelganger Effect (DE) describes the situation when an AI/ML model performs well on a validation set regardless of whether it has truly learned. DE may exaggerate the reported performance of the AI/ML model on real-world data, complicate model selection processes and lead towards false domain explanations. Here, we explore interactions between data normalization and DE. Although each normalization method produces different data distributions, they ultimately preserve rank orderings within each sample. It turns out that rank information alone is sufficient to induce high mutual correlations between samples. The only exception is the Gene Fuzzy Scoring (GFS) approach which impacts both scale and rank. Although GFS reduces mutual correlations, it does not provide an escape from DE, leading us to suspect that current approaches of identifying Data Doppelganger lack sensitivity. Contrary to previous reports, we find that GFS has reduced feature selection stability. However, GFS produces highly stable ML models which are also phenotypically relevant. We believe that combining GFS with current doppelganger mitigation measures may be a compelling synergistic approach towards biomedical data modeling. |
author2 |
Goh Wen Bin Wilson |
author_facet |
Goh Wen Bin Wilson Guo, Zexi |
format |
Thesis-Master by Research |
author |
Guo, Zexi |
author_sort |
Guo, Zexi |
title |
Can normalization methods allow escape from the doppelgänger effect in biomedical data? |
title_short |
Can normalization methods allow escape from the doppelgänger effect in biomedical data? |
title_full |
Can normalization methods allow escape from the doppelgänger effect in biomedical data? |
title_fullStr |
Can normalization methods allow escape from the doppelgänger effect in biomedical data? |
title_full_unstemmed |
Can normalization methods allow escape from the doppelgänger effect in biomedical data? |
title_sort |
can normalization methods allow escape from the doppelgänger effect in biomedical data? |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/165285 |
_version_ |
1764208044495863808 |