A comparative analysis of ENCODE and Cistrome in the context of TF binding signal

Background: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and dat...

Full description

Saved in:
Bibliographic Details
Main Authors: Perna, Stefano, Pinoli, Pietro, Ceri, Stefano, Wong, Limsoon
Other Authors: Lee Kong Chian School of Medicine (LKCMedicine)
Format: Article
Language:English
Published: 2024
Subjects:
Online Access:https://hdl.handle.net/10356/180519
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-180519
record_format dspace
spelling sg-ntu-dr.10356-1805192024-10-13T15:37:45Z A comparative analysis of ENCODE and Cistrome in the context of TF binding signal Perna, Stefano Pinoli, Pietro Ceri, Stefano Wong, Limsoon Lee Kong Chian School of Medicine (LKCMedicine) Medicine, Health and Life Sciences Transcription Factors SignalValue Background: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. Results: We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. Conclusions: The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation. Ministry of Education (MOE) National Research Foundation (NRF) Published version This work was supported by National Research Foundation, Singapore, under its Synthetic Biology Research and Development Programme (Award No: SBPP3); and by Ministry of Education, Singapore, Academic Research Fund Tier-1 (Award No: MOE T1 251RES1725). SC and PP are supported by the ERC AdG 693174 “Data-driven Genomic Computing (GeCo)”. 2024-10-10T01:26:40Z 2024-10-10T01:26:40Z 2024 Journal Article Perna, S., Pinoli, P., Ceri, S. & Wong, L. (2024). A comparative analysis of ENCODE and Cistrome in the context of TF binding signal. BMC Genomics, 25(Suppl 3), 817-. https://dx.doi.org/10.1186/s12864-024-10668-6 1471-2164 https://hdl.handle.net/10356/180519 10.1186/s12864-024-10668-6 2-s2.0-85202898256 Suppl 3 25 817 en SBPP3 MOE T1 251RES1725 BMC Genomics © 2024 The Author(s). Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Medicine, Health and Life Sciences
Transcription Factors
SignalValue
spellingShingle Medicine, Health and Life Sciences
Transcription Factors
SignalValue
Perna, Stefano
Pinoli, Pietro
Ceri, Stefano
Wong, Limsoon
A comparative analysis of ENCODE and Cistrome in the context of TF binding signal
description Background: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. Results: We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. Conclusions: The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.
author2 Lee Kong Chian School of Medicine (LKCMedicine)
author_facet Lee Kong Chian School of Medicine (LKCMedicine)
Perna, Stefano
Pinoli, Pietro
Ceri, Stefano
Wong, Limsoon
format Article
author Perna, Stefano
Pinoli, Pietro
Ceri, Stefano
Wong, Limsoon
author_sort Perna, Stefano
title A comparative analysis of ENCODE and Cistrome in the context of TF binding signal
title_short A comparative analysis of ENCODE and Cistrome in the context of TF binding signal
title_full A comparative analysis of ENCODE and Cistrome in the context of TF binding signal
title_fullStr A comparative analysis of ENCODE and Cistrome in the context of TF binding signal
title_full_unstemmed A comparative analysis of ENCODE and Cistrome in the context of TF binding signal
title_sort comparative analysis of encode and cistrome in the context of tf binding signal
publishDate 2024
url https://hdl.handle.net/10356/180519
_version_ 1814777805291913216