An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study

Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream da...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhang, Xinxin, Lee, Jimmy, Goh, Wilson Wen Bin
Other Authors:	School of Biological Sciences
Format:	Article
Language:	English
Published:	2022
Subjects:	Science::Biological sciences Biomarker Data Normalisation
Online Access:	https://hdl.handle.net/10356/160994
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-160994
record_format	dspace
spelling	sg-ntu-dr.10356-1609942023-02-28T17:13:13Z An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin School of Biological Sciences Lee Kong Chian School of Medicine (LKCMedicine) Centre for Biomedical Informatics Science::Biological sciences Biomarker Data Normalisation Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement. Ministry of Education (MOE) National Medical Research Council (NMRC) National Research Foundation (NRF) Published version This research/project was supported by the National Research Foundation, Singapore under its Industry Alignment Fund –Prepositioning (IAF-PP) Funding Initiative. This study was also supported by the National Research Foundation Singapore under the National Medical Research Council Translational and Clinical Research Flagship Programme (Grant No.: NMRC/TCR/003/2008) and a Ministry of Education (MOE), Singapore Tier 1 grant (Grant No. RG35/20). 2022-08-10T08:55:03Z 2022-08-10T08:55:03Z 2022 Journal Article Zhang, X., Lee, J. & Goh, W. W. B. (2022). An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study. Heliyon, 8(5), e09502-. https://dx.doi.org/10.1016/j.heliyon.2022.e09502 2405-8440 https://hdl.handle.net/10356/160994 10.1016/j.heliyon.2022.e09502 35663731 2-s2.0-85130838135 5 8 e09502 en NMRC/TCR/003/2008 RG35/20 Heliyon © 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/bync-nd/4.0/). application/pdf
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Science::Biological sciences Biomarker Data Normalisation
spellingShingle	Science::Biological sciences Biomarker Data Normalisation Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
description	Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement.
author2	School of Biological Sciences
author_facet	School of Biological Sciences Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin
format	Article
author	Zhang, Xinxin Lee, Jimmy Goh, Wilson Wen Bin
author_sort	Zhang, Xinxin
title	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_short	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_full	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_fullStr	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_full_unstemmed	An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
title_sort	investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
publishDate	2022
url	https://hdl.handle.net/10356/160994
_version_	1759853806000013312

An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study

Similar Items