A comparative analysis of ENCODE and Cistrome in the context of TF binding signal
Background: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and dat...
Saved in:
Main Authors: | , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/180519 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-180519 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1805192024-10-13T15:37:45Z A comparative analysis of ENCODE and Cistrome in the context of TF binding signal Perna, Stefano Pinoli, Pietro Ceri, Stefano Wong, Limsoon Lee Kong Chian School of Medicine (LKCMedicine) Medicine, Health and Life Sciences Transcription Factors SignalValue Background: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. Results: We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. Conclusions: The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation. Ministry of Education (MOE) National Research Foundation (NRF) Published version This work was supported by National Research Foundation, Singapore, under its Synthetic Biology Research and Development Programme (Award No: SBPP3); and by Ministry of Education, Singapore, Academic Research Fund Tier-1 (Award No: MOE T1 251RES1725). SC and PP are supported by the ERC AdG 693174 “Data-driven Genomic Computing (GeCo)”. 2024-10-10T01:26:40Z 2024-10-10T01:26:40Z 2024 Journal Article Perna, S., Pinoli, P., Ceri, S. & Wong, L. (2024). A comparative analysis of ENCODE and Cistrome in the context of TF binding signal. BMC Genomics, 25(Suppl 3), 817-. https://dx.doi.org/10.1186/s12864-024-10668-6 1471-2164 https://hdl.handle.net/10356/180519 10.1186/s12864-024-10668-6 2-s2.0-85202898256 Suppl 3 25 817 en SBPP3 MOE T1 251RES1725 BMC Genomics © 2024 The Author(s). Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. application/pdf |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Medicine, Health and Life Sciences Transcription Factors SignalValue |
spellingShingle |
Medicine, Health and Life Sciences Transcription Factors SignalValue Perna, Stefano Pinoli, Pietro Ceri, Stefano Wong, Limsoon A comparative analysis of ENCODE and Cistrome in the context of TF binding signal |
description |
Background: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. Results: We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. Conclusions: The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation. |
author2 |
Lee Kong Chian School of Medicine (LKCMedicine) |
author_facet |
Lee Kong Chian School of Medicine (LKCMedicine) Perna, Stefano Pinoli, Pietro Ceri, Stefano Wong, Limsoon |
format |
Article |
author |
Perna, Stefano Pinoli, Pietro Ceri, Stefano Wong, Limsoon |
author_sort |
Perna, Stefano |
title |
A comparative analysis of ENCODE and Cistrome in the context of TF binding signal |
title_short |
A comparative analysis of ENCODE and Cistrome in the context of TF binding signal |
title_full |
A comparative analysis of ENCODE and Cistrome in the context of TF binding signal |
title_fullStr |
A comparative analysis of ENCODE and Cistrome in the context of TF binding signal |
title_full_unstemmed |
A comparative analysis of ENCODE and Cistrome in the context of TF binding signal |
title_sort |
comparative analysis of encode and cistrome in the context of tf binding signal |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/180519 |
_version_ |
1814777805291913216 |