Large scale transcriptomics analyses for gene function annotation and regulation
The advances in methods for generating genome-wide gene expression data are reflected by the exponential growth in RNA-sequencing data deposited in sequence read archives over the past decade. While existing methods such as forward and reverse genetics and determination of protein structure remain t...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/170179 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-170179 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Science::Biological sciences::Molecular biology |
spellingShingle |
Science::Biological sciences::Molecular biology Tan, Qiao Wen Large scale transcriptomics analyses for gene function annotation and regulation |
description |
The advances in methods for generating genome-wide gene expression data are reflected by the exponential growth in RNA-sequencing data deposited in sequence read archives over the past decade. While existing methods such as forward and reverse genetics and determination of protein structure remain the gold standard for the validation of gene function, it is not possible to apply these methods for every single gene in every organism studied. Even in the most well-studied model organisms, such as Arabidopsis, only 42.85% of its protein coding genes are experimentally validated.
While co-expression is not a new method used in bioinformatics for the prediction of gene function, the power of the method is proportionate to the amount of data used in the analysis. The sheer amount and robustness of RNA-sequencing data enable us to apply co-expression analysis to more organisms with higher resolution. Despite the vast amount of data available, new data still needs to be generated to provide context-specific gene expression data, especially for biological processes that involve genes with multiple functions or are differentially co-expressed. However, the gap between data accumulation and the bioinformatic skill level of researchers remains to be closed. Although co-expression databases exist for this purpose, the database may be outdated, limited to commonly studied organisms, and offer limited customisation in terms of the dataset used to generate the co-expression network. Thus, tools that enable biologists that are not trained in computational biology to construct their own condition-dependent and independent datasets and perform co-expression analysis from raw RNA-sequencing data without the need for excessive hardware requirements would be highly beneficial.
The use of co-expression for gene function discovery using publicly available data is demonstrated in chapters 2 to 4, on organisms ranging from Plasmodium, a disease-causing parasite with many unique genes; to Artemisia annua, a plant that synthesises an important secondary metabolite used in the treatment of malaria which is caused by the Plasmodium parasite; and Nicotiana tabacum, where the nicotine produced by the plant is used in tobacco products. Due to the importance of Plasmodium and the lack of an existing co-expression database dedicated to it, the data for the organisms were downloaded and used to populate a co-expression database so that the wider community could benefit from the co-expression network generated. Using the database, we show how it can be used to identify genes that may be interesting for further characterisation based on their association to a biological function, association to gene module with many characterised virulent genes and organelle specificity. In chapters 3 to 4, we demonstrate the use of the pipelines that we have designed for use by biologists with little to no training in computational biology to perform co-expression analyses. Through the analyses of secondary metabolite biosynthesis pathways of artemisinin and nicotine, we highlight how co-expression neighbourhoods of genes known to be involved in secondary metabolite biosynthesis can reveal other biosynthetic genes, potential transcriptional regulators, and components such as transporters involved in the process.
The final chapter illustrates the importance of generating condition-specific data despite a large amount of transcriptomic data available for situations such as the study of the plant stress response. Through enrichment of biological processes, reconstruction of stress-specific gene regulatory networks and comparison of stress-specific transcription factors of Marchantia, we observe a hierarchy in stress response where certain stresses are more dominant, the superior performance of stress-specific networks indicative of interactions that are masked when all experiments are aggregated and a disagreement between the involvement of transcription factor orthologs of Arabidopsis and Marchantia during stress respectively.
Finally, we investigated the predictability of gene expression during combined through three-dimensional linear regression of single stress and combined stress gene expression and observed well-supported linear relationships where the magnitude of the coefficients corresponded to the dominance of the stress. |
author2 |
Marek Mutwil |
author_facet |
Marek Mutwil Tan, Qiao Wen |
format |
Thesis-Doctor of Philosophy |
author |
Tan, Qiao Wen |
author_sort |
Tan, Qiao Wen |
title |
Large scale transcriptomics analyses for gene function annotation and regulation |
title_short |
Large scale transcriptomics analyses for gene function annotation and regulation |
title_full |
Large scale transcriptomics analyses for gene function annotation and regulation |
title_fullStr |
Large scale transcriptomics analyses for gene function annotation and regulation |
title_full_unstemmed |
Large scale transcriptomics analyses for gene function annotation and regulation |
title_sort |
large scale transcriptomics analyses for gene function annotation and regulation |
publisher |
Nanyang Technological University |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/170179 |
_version_ |
1779156327505854464 |
spelling |
sg-ntu-dr.10356-1701792023-09-04T15:33:15Z Large scale transcriptomics analyses for gene function annotation and regulation Tan, Qiao Wen Marek Mutwil School of Biological Sciences mutwil@ntu.edu.sg Science::Biological sciences::Molecular biology The advances in methods for generating genome-wide gene expression data are reflected by the exponential growth in RNA-sequencing data deposited in sequence read archives over the past decade. While existing methods such as forward and reverse genetics and determination of protein structure remain the gold standard for the validation of gene function, it is not possible to apply these methods for every single gene in every organism studied. Even in the most well-studied model organisms, such as Arabidopsis, only 42.85% of its protein coding genes are experimentally validated. While co-expression is not a new method used in bioinformatics for the prediction of gene function, the power of the method is proportionate to the amount of data used in the analysis. The sheer amount and robustness of RNA-sequencing data enable us to apply co-expression analysis to more organisms with higher resolution. Despite the vast amount of data available, new data still needs to be generated to provide context-specific gene expression data, especially for biological processes that involve genes with multiple functions or are differentially co-expressed. However, the gap between data accumulation and the bioinformatic skill level of researchers remains to be closed. Although co-expression databases exist for this purpose, the database may be outdated, limited to commonly studied organisms, and offer limited customisation in terms of the dataset used to generate the co-expression network. Thus, tools that enable biologists that are not trained in computational biology to construct their own condition-dependent and independent datasets and perform co-expression analysis from raw RNA-sequencing data without the need for excessive hardware requirements would be highly beneficial. The use of co-expression for gene function discovery using publicly available data is demonstrated in chapters 2 to 4, on organisms ranging from Plasmodium, a disease-causing parasite with many unique genes; to Artemisia annua, a plant that synthesises an important secondary metabolite used in the treatment of malaria which is caused by the Plasmodium parasite; and Nicotiana tabacum, where the nicotine produced by the plant is used in tobacco products. Due to the importance of Plasmodium and the lack of an existing co-expression database dedicated to it, the data for the organisms were downloaded and used to populate a co-expression database so that the wider community could benefit from the co-expression network generated. Using the database, we show how it can be used to identify genes that may be interesting for further characterisation based on their association to a biological function, association to gene module with many characterised virulent genes and organelle specificity. In chapters 3 to 4, we demonstrate the use of the pipelines that we have designed for use by biologists with little to no training in computational biology to perform co-expression analyses. Through the analyses of secondary metabolite biosynthesis pathways of artemisinin and nicotine, we highlight how co-expression neighbourhoods of genes known to be involved in secondary metabolite biosynthesis can reveal other biosynthetic genes, potential transcriptional regulators, and components such as transporters involved in the process. The final chapter illustrates the importance of generating condition-specific data despite a large amount of transcriptomic data available for situations such as the study of the plant stress response. Through enrichment of biological processes, reconstruction of stress-specific gene regulatory networks and comparison of stress-specific transcription factors of Marchantia, we observe a hierarchy in stress response where certain stresses are more dominant, the superior performance of stress-specific networks indicative of interactions that are masked when all experiments are aggregated and a disagreement between the involvement of transcription factor orthologs of Arabidopsis and Marchantia during stress respectively. Finally, we investigated the predictability of gene expression during combined through three-dimensional linear regression of single stress and combined stress gene expression and observed well-supported linear relationships where the magnitude of the coefficients corresponded to the dominance of the stress. Doctor of Philosophy 2023-08-31T00:48:45Z 2023-08-31T00:48:45Z 2023 Thesis-Doctor of Philosophy Tan, Q. W. (2023). Large scale transcriptomics analyses for gene function annotation and regulation. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/170179 https://hdl.handle.net/10356/170179 10.32657/10356/170179 en 04INS000396C220 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |