A clustering-based preprocessing method for the elimination of unwanted residuals in metabolomic data

Introduction: The metabolome of a biological system is affected by multiple factors including factor of interest (e.g. metabolic perturbation due to disease) and unwanted factors or factors which are not primarily the focus of the study (e.g. batch effect, gender, and level of physical activity). Re...

Full description

Saved in:
Bibliographic Details
Main Authors: Wang, W., Cheng, K. K., Deng, L., Xu, J., Shen, G., Griffin, J. L., Dong, J.
Format: Article
Published: Springer New York LLC 2017
Subjects:
Online Access:http://eprints.utm.my/id/eprint/76957/
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85006757598&doi=10.1007%2fs11306-016-1146-y&partnerID=40&md5=7c26f1e3daaa1340ae08c70be1798666
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Universiti Teknologi Malaysia
Description
Summary:Introduction: The metabolome of a biological system is affected by multiple factors including factor of interest (e.g. metabolic perturbation due to disease) and unwanted factors or factors which are not primarily the focus of the study (e.g. batch effect, gender, and level of physical activity). Removal of these unwanted data variations is advantageous, as the unwanted variations may complicate biological interpretation of the data. Objectives: We aim to develop a new unwanted variations elimination (UVE) method called clustering-based unwanted residuals elimination (CURE) to reduce metabolic variation caused by unwanted/hidden factors in metabolomic data. Methods: A mean-centered metabolomic dataset can be viewed as a combination of a studied factor matrix and a residual matrix. The CURE method assumes that the residual should be normally distributed if it only contains inter-individual variation. However, if the residual forms multiple clusters in feature subspace of principal components analysis or partial least squares discriminant analysis, the residual may contain variation due to unwanted factors. This unwanted variation is removed by doing K-means data clustering and removal of means for each cluster from the residuals. The process is iterated until the residual no longer forms multiple clusters in feature subspace. Results: Three simulated datasets and a human metabolomic dataset were used to demonstrate the performance of the proposed CURE method. CURE was found able to remove most of the variations caused by unwanted factors, while preserving inter-individual variation between samples. Conclusion: The CURE method can effectively remove unwanted data variation, and can serve as an alternative UVE method for metabolomic data.