Digital karyotyping of cancer cell lines from NGS data
Human immortalized and transformed cell lines are the workhorse of cancer research. They are widely used to study cancer biology and to test and screen anti-cancer compounds in order to improve the efficacy of cancer treatment. Detection of DNA copy number alterations (CNAs) is critical to understan...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2020
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/138176 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-138176 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Mathematics of computing::Mathematical software Engineering::Computer science and engineering::Mathematics of computing::Probability and statistics |
spellingShingle |
Engineering::Computer science and engineering::Mathematics of computing::Mathematical software Engineering::Computer science and engineering::Mathematics of computing::Probability and statistics Ahmed Ibrahim Samir Khalil Digital karyotyping of cancer cell lines from NGS data |
description |
Human immortalized and transformed cell lines are the workhorse of cancer research. They are widely used to study cancer biology and to test and screen anti-cancer compounds in order to improve the efficacy of cancer treatment. Detection of DNA copy number alterations (CNAs) is critical to understand genetic diversity, genome evolution and pathological conditions of cancer cells. Several computational tools have been developed to identify CNAs using the read depth (RD) from single-sample next generation sequencing (NGS) data. However, cancer genomes are plagued with widespread multi-level structural aberrations, such as large-scale copy number variations (LCVs) and focal alterations (FAs), of different length scales, and distinct biological origins and functions. Additionally, cancer cells undergo an adaptive evolutionary process even under controlled culture environment to generate clonal variability which includes changes in copy number. Moreover, genomic NGS data are prone to inherent biases such as GC content, low-mappability regions, experiment biases, and coverage-influenced overdispersion. All these attributes pose challenges for authentication of cancer cell lines as well as accurate modeling and interpretation of their biological phenomenon. Consequently, in case of cancer cell lines, these computational tools fail to identify the correct CNA profile and to distinguish between large-scale and focal alterations due to inaccurate modeling of cancer genomes. Additionally, at low-coverage (~1x-2x), RD signal is affected by overdispersion-driven biases which significantly inflate false detection of CNA regions. For solving these problems, we have developed AStra and CNAtra tools for digital karyotyping and discovery of CNAs of cancer genomes.
First, AStra has been developed for whole-genome sequencing (WGS)-derived digital karyotyping and authentication of cancer cells. AStra is a Python-based software for de novo estimation of the genome-wide aneuploidy profile from raw WGS reads without any prior information about exact chromosome number and aneuploidy levels. AStra identifies the best-fit aneuploidy profile with most genomic segments toward copy number states of positive integer values. It provides an analytical and visual platform for cell authentication through rapid and easy comparison between different cell lines/strains. We demonstrated that aneuploidy profile offers a unique signature that can distinguish the clonal variations of a cell line. We evaluated our approach using simulated and cancer datasets and showed that cancer cell line exhibit distinct aneuploidy profile which corroborates well with the experimental observations reported earlier. Additionally, AStra provides CN-associated features such as the whole-genome ploidy level and number that can be used for tuning single-sample CNA detection tools.
Next, we have developed CNAtra to hierarchically discover and classify ‘large-scale’ and ‘focal’ copy number gain/loss from a single WGS sample. CNAtra first utilizes a multimodal-based distribution to estimate the copy number reference from the complex RD profile of the cancer genome. Savitzky-Golay smoothing filter and Modified Varri segmentation are then implemented to capture the change points of the RD signal. We then developed a CN state-driven merging algorithm to identify the large segments with distinct copy number. Next, focal alterations were identified in each large segment using coverage-based thresholding to mitigate the adverse effects of signal variations. Using cancer cell lines datasets and clinical samples from patients, we confirmed CNAtra’s ability to detect and distinguish the segmental aneuploidies and focal alterations. We used realistic simulated data for benchmarking the performance of CNAtra against other single-sample detection tools, where we artificially introduced CNAs in the original cancer profiles. We found that CNAtra is superior in terms of precision, recall, and f-measure. CNAtra shows the highest sensitivity of 93% and 97% for detecting large-scale and focal alterations respectively. Visual inspection of CNAs revealed that CNAtra is the most robust detection tool for low-coverage cancer data.
The presence of CNAs in the genome can impact the interpretation of many genetic analyses. One of the examples is the analysis of genome-wide chromatin interactions. Apart from imaging techniques, high-throughput chromosome conformation capture (3C)-based techniques, such as Hi-C, have been extensively used to capture the spatial organization of chromatin, in the form of contact map employing NGS reads. Normalization of Hi-C contact maps is essential for accurate modelling and interpretation of high-throughput chromatin conformation experiments. Most Hi-C correction methods are originally developed for normal cell lines and mainly target systematic biases either implicitly or explicitly. However, most Hi-C data were generated using cancer cell lines that carry multi-level copy number CNAs which render over- or under-representation of interaction frequencies compared to CN-neutral regions. Therefore, CNA-driven biases need to be corrected to generate euploid-equivalent chromatin contact maps in cell lines with abnormal karyotypes.
We developed HiCNAtra framework that extracts RD signal from Hi-C or 3C-seq reads to generate the high-resolution CNA profile and use this information to correct the systematic biases in chromatin contact map. We introduce a novel “entire fragment” counting approach for better estimation of the RD signal and CNA profile. We demonstrated that RD signal calculated from Hi-C reads recapitulates the WGS-derived coverage signal of the same cell line. Utilizing this CNA information with other systematic biases, HiCNAtra simultaneously estimate the contribution of each bias and explicitly correct the interaction matrix using Poisson regression. HiCNAtra normalization results in removal of CNA-induced artifacts on contact map leading to a ‘homogeneous’ heatmap. Benchmarking against OneD and CNV-Adjusted Iterative Correction (CAIC) methods, which specifically targets CNA bias, as well as commonly-used iterative correction and eigenvector decomposition (ICE) method showed that HiCNAtra correction results in the least 1D signal variations without deforming the inherent chromatin interaction landscape.
To sum up, our computational tools provide an analytical and visualization platform for digital karyotyping of hyperploid cancer cell lines. AStra provides the genome-wide snapshot of large-scale chromosomal alterations of the cancer genome such as whole-genome ploidy. CNAtra and HiCNAtra provide more-detailed karyotyping that includes the LCV and FA information from WGS and chromatin interaction data respectively. They also provide platforms for visualization of the CNA profiles and chromatin contact maps. |
author2 |
Anupam Chattopadhyay |
author_facet |
Anupam Chattopadhyay Ahmed Ibrahim Samir Khalil |
format |
Thesis-Doctor of Philosophy |
author |
Ahmed Ibrahim Samir Khalil |
author_sort |
Ahmed Ibrahim Samir Khalil |
title |
Digital karyotyping of cancer cell lines from NGS data |
title_short |
Digital karyotyping of cancer cell lines from NGS data |
title_full |
Digital karyotyping of cancer cell lines from NGS data |
title_fullStr |
Digital karyotyping of cancer cell lines from NGS data |
title_full_unstemmed |
Digital karyotyping of cancer cell lines from NGS data |
title_sort |
digital karyotyping of cancer cell lines from ngs data |
publisher |
Nanyang Technological University |
publishDate |
2020 |
url |
https://hdl.handle.net/10356/138176 |
_version_ |
1683493041292181504 |
spelling |
sg-ntu-dr.10356-1381762020-10-28T08:40:45Z Digital karyotyping of cancer cell lines from NGS data Ahmed Ibrahim Samir Khalil Anupam Chattopadhyay School of Computer Science and Engineering anupam@ntu.edu.sg Engineering::Computer science and engineering::Mathematics of computing::Mathematical software Engineering::Computer science and engineering::Mathematics of computing::Probability and statistics Human immortalized and transformed cell lines are the workhorse of cancer research. They are widely used to study cancer biology and to test and screen anti-cancer compounds in order to improve the efficacy of cancer treatment. Detection of DNA copy number alterations (CNAs) is critical to understand genetic diversity, genome evolution and pathological conditions of cancer cells. Several computational tools have been developed to identify CNAs using the read depth (RD) from single-sample next generation sequencing (NGS) data. However, cancer genomes are plagued with widespread multi-level structural aberrations, such as large-scale copy number variations (LCVs) and focal alterations (FAs), of different length scales, and distinct biological origins and functions. Additionally, cancer cells undergo an adaptive evolutionary process even under controlled culture environment to generate clonal variability which includes changes in copy number. Moreover, genomic NGS data are prone to inherent biases such as GC content, low-mappability regions, experiment biases, and coverage-influenced overdispersion. All these attributes pose challenges for authentication of cancer cell lines as well as accurate modeling and interpretation of their biological phenomenon. Consequently, in case of cancer cell lines, these computational tools fail to identify the correct CNA profile and to distinguish between large-scale and focal alterations due to inaccurate modeling of cancer genomes. Additionally, at low-coverage (~1x-2x), RD signal is affected by overdispersion-driven biases which significantly inflate false detection of CNA regions. For solving these problems, we have developed AStra and CNAtra tools for digital karyotyping and discovery of CNAs of cancer genomes. First, AStra has been developed for whole-genome sequencing (WGS)-derived digital karyotyping and authentication of cancer cells. AStra is a Python-based software for de novo estimation of the genome-wide aneuploidy profile from raw WGS reads without any prior information about exact chromosome number and aneuploidy levels. AStra identifies the best-fit aneuploidy profile with most genomic segments toward copy number states of positive integer values. It provides an analytical and visual platform for cell authentication through rapid and easy comparison between different cell lines/strains. We demonstrated that aneuploidy profile offers a unique signature that can distinguish the clonal variations of a cell line. We evaluated our approach using simulated and cancer datasets and showed that cancer cell line exhibit distinct aneuploidy profile which corroborates well with the experimental observations reported earlier. Additionally, AStra provides CN-associated features such as the whole-genome ploidy level and number that can be used for tuning single-sample CNA detection tools. Next, we have developed CNAtra to hierarchically discover and classify ‘large-scale’ and ‘focal’ copy number gain/loss from a single WGS sample. CNAtra first utilizes a multimodal-based distribution to estimate the copy number reference from the complex RD profile of the cancer genome. Savitzky-Golay smoothing filter and Modified Varri segmentation are then implemented to capture the change points of the RD signal. We then developed a CN state-driven merging algorithm to identify the large segments with distinct copy number. Next, focal alterations were identified in each large segment using coverage-based thresholding to mitigate the adverse effects of signal variations. Using cancer cell lines datasets and clinical samples from patients, we confirmed CNAtra’s ability to detect and distinguish the segmental aneuploidies and focal alterations. We used realistic simulated data for benchmarking the performance of CNAtra against other single-sample detection tools, where we artificially introduced CNAs in the original cancer profiles. We found that CNAtra is superior in terms of precision, recall, and f-measure. CNAtra shows the highest sensitivity of 93% and 97% for detecting large-scale and focal alterations respectively. Visual inspection of CNAs revealed that CNAtra is the most robust detection tool for low-coverage cancer data. The presence of CNAs in the genome can impact the interpretation of many genetic analyses. One of the examples is the analysis of genome-wide chromatin interactions. Apart from imaging techniques, high-throughput chromosome conformation capture (3C)-based techniques, such as Hi-C, have been extensively used to capture the spatial organization of chromatin, in the form of contact map employing NGS reads. Normalization of Hi-C contact maps is essential for accurate modelling and interpretation of high-throughput chromatin conformation experiments. Most Hi-C correction methods are originally developed for normal cell lines and mainly target systematic biases either implicitly or explicitly. However, most Hi-C data were generated using cancer cell lines that carry multi-level copy number CNAs which render over- or under-representation of interaction frequencies compared to CN-neutral regions. Therefore, CNA-driven biases need to be corrected to generate euploid-equivalent chromatin contact maps in cell lines with abnormal karyotypes. We developed HiCNAtra framework that extracts RD signal from Hi-C or 3C-seq reads to generate the high-resolution CNA profile and use this information to correct the systematic biases in chromatin contact map. We introduce a novel “entire fragment” counting approach for better estimation of the RD signal and CNA profile. We demonstrated that RD signal calculated from Hi-C reads recapitulates the WGS-derived coverage signal of the same cell line. Utilizing this CNA information with other systematic biases, HiCNAtra simultaneously estimate the contribution of each bias and explicitly correct the interaction matrix using Poisson regression. HiCNAtra normalization results in removal of CNA-induced artifacts on contact map leading to a ‘homogeneous’ heatmap. Benchmarking against OneD and CNV-Adjusted Iterative Correction (CAIC) methods, which specifically targets CNA bias, as well as commonly-used iterative correction and eigenvector decomposition (ICE) method showed that HiCNAtra correction results in the least 1D signal variations without deforming the inherent chromatin interaction landscape. To sum up, our computational tools provide an analytical and visualization platform for digital karyotyping of hyperploid cancer cell lines. AStra provides the genome-wide snapshot of large-scale chromosomal alterations of the cancer genome such as whole-genome ploidy. CNAtra and HiCNAtra provide more-detailed karyotyping that includes the LCV and FA information from WGS and chromatin interaction data respectively. They also provide platforms for visualization of the CNA profiles and chromatin contact maps. Doctor of Philosophy 2020-04-28T01:48:56Z 2020-04-28T01:48:56Z 2020 Thesis-Doctor of Philosophy Ahmed Ibrahim Samir Khalil. (2020). Digital karyotyping of cancer cell lines from NGS data. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/138176 10.32657/10356/138176 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |