About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature

Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by t...

Full description

Saved in:
Bibliographic Details
Main Authors: Tantoso, Erwin, Eisenhaber, Birgit, Sinha, Swati, Jensen, Lars Juhl, Eisenhaber, Frank
Other Authors: School of Biological Sciences
Format: Article
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169685
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-169685
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Science::Biological sciences
Gene Function Space
Uncharacterized Genes
spellingShingle Science::Biological sciences
Gene Function Space
Uncharacterized Genes
Tantoso, Erwin
Eisenhaber, Birgit
Sinha, Swati
Jensen, Lars Juhl
Eisenhaber, Frank
About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
description Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. Results: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. Conclusion: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible.
author2 School of Biological Sciences
author_facet School of Biological Sciences
Tantoso, Erwin
Eisenhaber, Birgit
Sinha, Swati
Jensen, Lars Juhl
Eisenhaber, Frank
format Article
author Tantoso, Erwin
Eisenhaber, Birgit
Sinha, Swati
Jensen, Lars Juhl
Eisenhaber, Frank
author_sort Tantoso, Erwin
title About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_short About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_full About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_fullStr About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_full_unstemmed About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature
title_sort about the dark corners in the gene function space of escherichia coli remaining without illumination by scientific literature
publishDate 2023
url https://hdl.handle.net/10356/169685
_version_ 1773551208837939200
spelling sg-ntu-dr.10356-1696852023-07-31T15:32:17Z About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature Tantoso, Erwin Eisenhaber, Birgit Sinha, Swati Jensen, Lars Juhl Eisenhaber, Frank School of Biological Sciences Genome Institute of Singapore, A*STAR Bioinformatics Institute, A*STAR Science::Biological sciences Gene Function Space Uncharacterized Genes Background: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. Results: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name’s occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005–2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. Conclusion: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25–30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible. Published version 2023-07-31T04:23:05Z 2023-07-31T04:23:05Z 2023 Journal Article Tantoso, E., Eisenhaber, B., Sinha, S., Jensen, L. J. & Eisenhaber, F. (2023). About the dark corners in the gene function space of Escherichia coli remaining without illumination by scientific literature. Biology Direct, 18(1), 7-. https://dx.doi.org/10.1186/s13062-023-00362-0 1745-6150 https://hdl.handle.net/10356/169685 10.1186/s13062-023-00362-0 36855185 2-s2.0-85149153019 1 18 7 en Biology Direct © The Author(s) 2023. Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. application/pdf