Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction

Background: Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Th...

Full description

Saved in:
Bibliographic Details
Main Authors: Tran, Ngoc Hieu, Chen, Xin
Other Authors: School of Physical and Mathematical Sciences
Format: Article
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/97426
http://hdl.handle.net/10220/19479
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-97426
record_format dspace
spelling sg-ntu-dr.10356-974262023-02-28T19:22:54Z Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction Tran, Ngoc Hieu Chen, Xin School of Physical and Mathematical Sciences DRNTU::Science Background: Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads. Results: Recently several k-mer based distance measures such as CVTree, dS 2 , and co-phylog have been proposed or enhanced to address this problem. However, how to choose an optimal k value for those distance measures is not trivial since it may depend on different aspects of the sequence data. In this paper, we considered an alternative parameter-free approach: compression-based distance measures. These measures have shown good performance for the comparison of long genomic sequences, but they have not yet been tested on NGS short reads. Hence, we performed extensive validation in this study and showed that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the multiple sequence alignment approach, and from existing benchmarks in the literature. Moreover, as the compression-based distance measures are parameter-free, no parameter optimization is required and these measures still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels. Conclusions: The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genomic sequences as well as the comparison of NGS samples of short reads. NMRC (Natl Medical Research Council, S’pore) Published version 2014-05-30T06:48:50Z 2019-12-06T19:42:40Z 2014-05-30T06:48:50Z 2019-12-06T19:42:40Z 2014 2014 Journal Article Tran, N. H., & Chen, X. (2014). Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction. BMC Research Notes, 7:320. 1756-0500 https://hdl.handle.net/10356/97426 http://hdl.handle.net/10220/19479 10.1186/1756-0500-7-320 24886411 180418 en BMC research notes © 2014 Tran and Chen. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Science
spellingShingle DRNTU::Science
Tran, Ngoc Hieu
Chen, Xin
Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
description Background: Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads. Results: Recently several k-mer based distance measures such as CVTree, dS 2 , and co-phylog have been proposed or enhanced to address this problem. However, how to choose an optimal k value for those distance measures is not trivial since it may depend on different aspects of the sequence data. In this paper, we considered an alternative parameter-free approach: compression-based distance measures. These measures have shown good performance for the comparison of long genomic sequences, but they have not yet been tested on NGS short reads. Hence, we performed extensive validation in this study and showed that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the multiple sequence alignment approach, and from existing benchmarks in the literature. Moreover, as the compression-based distance measures are parameter-free, no parameter optimization is required and these measures still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels. Conclusions: The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genomic sequences as well as the comparison of NGS samples of short reads.
author2 School of Physical and Mathematical Sciences
author_facet School of Physical and Mathematical Sciences
Tran, Ngoc Hieu
Chen, Xin
format Article
author Tran, Ngoc Hieu
Chen, Xin
author_sort Tran, Ngoc Hieu
title Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
title_short Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
title_full Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
title_fullStr Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
title_full_unstemmed Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
title_sort comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction
publishDate 2014
url https://hdl.handle.net/10356/97426
http://hdl.handle.net/10220/19479
_version_ 1759855798205284352