About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by t...
Saved in:
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/169685 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-169685 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Science::Biological sciences Gene Function Space Uncharacterized Genes |
spellingShingle |
Science::Biological sciences Gene Function Space Uncharacterized Genes Tantoso, Erwin Eisenhaber, Birgit Sinha, Swati Jensen, Lars Juhl Eisenhaber, Frank About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature |
description |
Background:
Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions.
Results:
The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms.
Conclusion:
If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible. |
author2 |
School of Biological Sciences |
author_facet |
School of Biological Sciences Tantoso, Erwin Eisenhaber, Birgit Sinha, Swati Jensen, Lars Juhl Eisenhaber, Frank |
format |
Article |
author |
Tantoso, Erwin Eisenhaber, Birgit Sinha, Swati Jensen, Lars Juhl Eisenhaber, Frank |
author_sort |
Tantoso, Erwin |
title |
About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature |
title_short |
About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature |
title_full |
About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature |
title_fullStr |
About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature |
title_full_unstemmed |
About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature |
title_sort |
about the dark corners in the gene function space of escherichia coli remaining without illumination by scientific literature |
publishDate |
2023 |
url |
https://hdl.handle.net/10356/169685 |
_version_ |
1773551208837939200 |
spelling |
sg-ntu-dr.10356-1696852023-07-31T15:32:17Z About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature Tantoso, Erwin Eisenhaber, Birgit Sinha, Swati Jensen, Lars Juhl Eisenhaber, Frank School of Biological Sciences Genome Institute of Singapore, A*STAR Bioinformatics Institute, A*STAR Science::Biological sciences Gene Function Space Uncharacterized Genes Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. Results: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. Conclusion: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible. Published version 2023-07-31T04:23:05Z 2023-07-31T04:23:05Z 2023 Journal Article Tantoso, E., Eisenhaber, B., Sinha, S., Jensen, L. J. & Eisenhaber, F. (2023). About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature. Biology Direct, 18(1), 7-. https://dx.doi.org/10.1186/s13062-023-00362-0 1745-6150 https://hdl.handle.net/10356/169685 10.1186/s13062-023-00362-0 36855185 2-s2.0-85149153019 1 18 7 en Biology Direct © The Author(s) 2023. Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. application/pdf |