Discovery of novel biomarkers using machine learning based methods in cross disorder psychiatry cohorts
Psychiatric disorders (PD) are gaining more attention nowadays due to it profound negative impact on individuals and the society. Therefore, genomic psychiatry is also gaining more interests as it holds much promise in biomarker discovery of PD. However, genomic dataset usually consists of high dime...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/166555 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Psychiatric disorders (PD) are gaining more attention nowadays due to it profound negative impact on individuals and the society. Therefore, genomic psychiatry is also gaining more interests as it holds much promise in biomarker discovery of PD. However, genomic dataset usually consists of high dimensional data with small sample size in a psychiatric outpatient clinic setting, which impose a major challenge for accurate and significant clinical analysis of the transcriptomic data. In this project, we address this issue by proposing a pipeline involving the state-of-the-art machine learning based methods to extract the salient set of genes, which are also known as features of the genomic data as potential biomarkers for future biological analysis.
By using machine learning techniques, we aim to narrow down the number of genes, which are potential biomarkers that have a significant impact in identifying bipolar disorders (BD). To better stimulate the application of a psychiatric outpatient clinic setting, we carried out the investigation on transcriptomic data of lithium / non-lithium treated bipolar patients (n=240) and healthy controls (n=240). After a gamut of data pre-processing, univariate filtering using F-test was applied on the genomic data, followed with Principal Component Analysis (PCA) to perform dimensionality reduction. Lastly, we implemented multivariate feature selection method of recursive feature elimination using various machine learning models with nested cross-validation to select the set of genes giving the best prediction accuracy in distinguishing BD patients with healthy controls. The results obtained indicated that the genes selected by our proposed pipeline are able to achieve higher predictive accuracy in classifying BD patients and BD patients treated with lithium from healthy controls.
We conclude that our proposed feature selection pipeline combining univariate filtering, PCA and multivariate feature selection with machine learning based methods is capable of overcoming the challenges of high dimensionality of gene expression data, and is able to select relevant salient features for further biological analysis. |
---|