Large scale transcriptomics analyses for gene function annotation and regulation

The advances in methods for generating genome-wide gene expression data are reflected by the exponential growth in RNA-sequencing data deposited in sequence read archives over the past decade. While existing methods such as forward and reverse genetics and determination of protein structure remain t...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, Qiao Wen
Other Authors: Marek Mutwil
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/170179
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:The advances in methods for generating genome-wide gene expression data are reflected by the exponential growth in RNA-sequencing data deposited in sequence read archives over the past decade. While existing methods such as forward and reverse genetics and determination of protein structure remain the gold standard for the validation of gene function, it is not possible to apply these methods for every single gene in every organism studied. Even in the most well-studied model organisms, such as Arabidopsis, only 42.85% of its protein coding genes are experimentally validated. While co-expression is not a new method used in bioinformatics for the prediction of gene function, the power of the method is proportionate to the amount of data used in the analysis. The sheer amount and robustness of RNA-sequencing data enable us to apply co-expression analysis to more organisms with higher resolution. Despite the vast amount of data available, new data still needs to be generated to provide context-specific gene expression data, especially for biological processes that involve genes with multiple functions or are differentially co-expressed. However, the gap between data accumulation and the bioinformatic skill level of researchers remains to be closed. Although co-expression databases exist for this purpose, the database may be outdated, limited to commonly studied organisms, and offer limited customisation in terms of the dataset used to generate the co-expression network. Thus, tools that enable biologists that are not trained in computational biology to construct their own condition-dependent and independent datasets and perform co-expression analysis from raw RNA-sequencing data without the need for excessive hardware requirements would be highly beneficial. The use of co-expression for gene function discovery using publicly available data is demonstrated in chapters 2 to 4, on organisms ranging from Plasmodium, a disease-causing parasite with many unique genes; to Artemisia annua, a plant that synthesises an important secondary metabolite used in the treatment of malaria which is caused by the Plasmodium parasite; and Nicotiana tabacum, where the nicotine produced by the plant is used in tobacco products. Due to the importance of Plasmodium and the lack of an existing co-expression database dedicated to it, the data for the organisms were downloaded and used to populate a co-expression database so that the wider community could benefit from the co-expression network generated. Using the database, we show how it can be used to identify genes that may be interesting for further characterisation based on their association to a biological function, association to gene module with many characterised virulent genes and organelle specificity. In chapters 3 to 4, we demonstrate the use of the pipelines that we have designed for use by biologists with little to no training in computational biology to perform co-expression analyses. Through the analyses of secondary metabolite biosynthesis pathways of artemisinin and nicotine, we highlight how co-expression neighbourhoods of genes known to be involved in secondary metabolite biosynthesis can reveal other biosynthetic genes, potential transcriptional regulators, and components such as transporters involved in the process. The final chapter illustrates the importance of generating condition-specific data despite a large amount of transcriptomic data available for situations such as the study of the plant stress response. Through enrichment of biological processes, reconstruction of stress-specific gene regulatory networks and comparison of stress-specific transcription factors of Marchantia, we observe a hierarchy in stress response where certain stresses are more dominant, the superior performance of stress-specific networks indicative of interactions that are masked when all experiments are aggregated and a disagreement between the involvement of transcription factor orthologs of Arabidopsis and Marchantia during stress respectively. Finally, we investigated the predictability of gene expression during combined through three-dimensional linear regression of single stress and combined stress gene expression and observed well-supported linear relationships where the magnitude of the coefficients corresponded to the dominance of the stress.