Discovery of novel biomarkers using machine learning based methods in cross disorder psychiatry cohorts

Psychiatric disorders (PD) are gaining more attention nowadays due to it profound negative impact on individuals and the society. Therefore, genomic psychiatry is also gaining more interests as it holds much promise in biomarker discovery of PD. However, genomic dataset usually consists of high dime...

Full description

Saved in:
Bibliographic Details
Main Author: Cao, Shuwen
Other Authors: Jagath C Rajapakse
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/166555
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Psychiatric disorders (PD) are gaining more attention nowadays due to it profound negative impact on individuals and the society. Therefore, genomic psychiatry is also gaining more interests as it holds much promise in biomarker discovery of PD. However, genomic dataset usually consists of high dimensional data with small sample size in a psychiatric outpatient clinic setting, which impose a major challenge for accurate and significant clinical analysis of the transcriptomic data. In this project, we address this issue by proposing a pipeline involving the state-of-the-art machine learning based methods to extract the salient set of genes, which are also known as features of the genomic data as potential biomarkers for future biological analysis. By using machine learning techniques, we aim to narrow down the number of genes, which are potential biomarkers that have a significant impact in identifying bipolar disorders (BD). To better stimulate the application of a psychiatric outpatient clinic setting, we carried out the investigation on transcriptomic data of lithium / non-lithium treated bipolar patients (n=240) and healthy controls (n=240). After a gamut of data pre-processing, univariate filtering using F-test was applied on the genomic data, followed with Principal Component Analysis (PCA) to perform dimensionality reduction. Lastly, we implemented multivariate feature selection method of recursive feature elimination using various machine learning models with nested cross-validation to select the set of genes giving the best prediction accuracy in distinguishing BD patients with healthy controls. The results obtained indicated that the genes selected by our proposed pipeline are able to achieve higher predictive accuracy in classifying BD patients and BD patients treated with lithium from healthy controls. We conclude that our proposed feature selection pipeline combining univariate filtering, PCA and multivariate feature selection with machine learning based methods is capable of overcoming the challenges of high dimensionality of gene expression data, and is able to select relevant salient features for further biological analysis.