Mining, annotating and visualizing evolutionary networks of influenza virus

Influenza Virus is hosted by both avian and mammalian species. They evolve rapidly through the genetic shift and drift to escape antibody binding, this can cause seasonal epidemics and devastating pandemics. The classification of influenza genes into lineages is an important part of the analysis...

Full description

Saved in:
Bibliographic Details
Main Author: Deshpande Akhila Sameer
Other Authors: Kwoh Chee Keong
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/74217
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-74217
record_format dspace
spelling sg-ntu-dr.10356-742172019-12-10T13:06:54Z Mining, annotating and visualizing evolutionary networks of influenza virus Deshpande Akhila Sameer Kwoh Chee Keong Wee Kim Wee School of Communication and Information Bioinformatics Research Centre DRNTU::Engineering::Computer science and engineering Influenza Virus is hosted by both avian and mammalian species. They evolve rapidly through the genetic shift and drift to escape antibody binding, this can cause seasonal epidemics and devastating pandemics. The classification of influenza genes into lineages is an important part of the analysis of viral sequence data. WHOnFAOnOIE H5N1 evolution working group has specified criteria for defining the clade from a phylogenetic tree for HA sequences that have evolved from A/- goose/Guandong/1996 H5N1 virus. Independent studies have classified subtypes like H1N1 and H9N2 into clades for establishing common nomenclature. Gene sequences could be classified based on similarity to pre-defined lineages if lineages are known. But there is a lack of tools that automatically produce clade information from input gene sequences; manual inspections are tedious. This research presents a novel approach MAVEN: Mining and Annotating Evolutionary Network, to determine clade information of Influenza Virus of a particular subtype, which has emerged as a consequence of selective genetic bottlenecks during transmission. MAVEN uses combination of phylogenetic trees and unsupervised machine learning algorithms to find the Clades. In this approach, Phylogenetic trees are constructed using a fixed number of random HA sequences from the input sequences, for each tree the sequence of its internal nodes is inferred using Fitch. Each node in the tree is tested for non-significance, using student t-test on within and between distances for the leaf node sequences present in two child nodes. Clustering algorithm run on these selected nodes groups them into the set of clusters. A tree is constructed for each cluster and a representative node (bottleneck sequence) is found. All the sequences are then v assigned to these representative based on their distance from the bottleneck sequence forming the clades. This solution not only clusters clades based on lineages but also expresses lineage relationship between each cluster in form of a tree. We discuss a case study when MAVEN is applied on 7052 H1N1 HA sequences that have been examined by a previous publication, and have already been classified into clades. We then proceed to compare both cluster classifications using cluster validation indexes like Entropy, Silhouette Coefficient, Dunn index and PearsonGamma, and note that MAVEN performs better on all indexes. While the influenza HA sequences were used for the purpose of this study, this approach could be applied to any genes for lineage assignment. Master of Science (Information Studies) 2018-05-09T08:03:43Z 2018-05-09T08:03:43Z 2018 Thesis http://hdl.handle.net/10356/74217 en Nanyang Technological University 77 p. application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Deshpande Akhila Sameer
Mining, annotating and visualizing evolutionary networks of influenza virus
description Influenza Virus is hosted by both avian and mammalian species. They evolve rapidly through the genetic shift and drift to escape antibody binding, this can cause seasonal epidemics and devastating pandemics. The classification of influenza genes into lineages is an important part of the analysis of viral sequence data. WHOnFAOnOIE H5N1 evolution working group has specified criteria for defining the clade from a phylogenetic tree for HA sequences that have evolved from A/- goose/Guandong/1996 H5N1 virus. Independent studies have classified subtypes like H1N1 and H9N2 into clades for establishing common nomenclature. Gene sequences could be classified based on similarity to pre-defined lineages if lineages are known. But there is a lack of tools that automatically produce clade information from input gene sequences; manual inspections are tedious. This research presents a novel approach MAVEN: Mining and Annotating Evolutionary Network, to determine clade information of Influenza Virus of a particular subtype, which has emerged as a consequence of selective genetic bottlenecks during transmission. MAVEN uses combination of phylogenetic trees and unsupervised machine learning algorithms to find the Clades. In this approach, Phylogenetic trees are constructed using a fixed number of random HA sequences from the input sequences, for each tree the sequence of its internal nodes is inferred using Fitch. Each node in the tree is tested for non-significance, using student t-test on within and between distances for the leaf node sequences present in two child nodes. Clustering algorithm run on these selected nodes groups them into the set of clusters. A tree is constructed for each cluster and a representative node (bottleneck sequence) is found. All the sequences are then v assigned to these representative based on their distance from the bottleneck sequence forming the clades. This solution not only clusters clades based on lineages but also expresses lineage relationship between each cluster in form of a tree. We discuss a case study when MAVEN is applied on 7052 H1N1 HA sequences that have been examined by a previous publication, and have already been classified into clades. We then proceed to compare both cluster classifications using cluster validation indexes like Entropy, Silhouette Coefficient, Dunn index and PearsonGamma, and note that MAVEN performs better on all indexes. While the influenza HA sequences were used for the purpose of this study, this approach could be applied to any genes for lineage assignment.
author2 Kwoh Chee Keong
author_facet Kwoh Chee Keong
Deshpande Akhila Sameer
format Theses and Dissertations
author Deshpande Akhila Sameer
author_sort Deshpande Akhila Sameer
title Mining, annotating and visualizing evolutionary networks of influenza virus
title_short Mining, annotating and visualizing evolutionary networks of influenza virus
title_full Mining, annotating and visualizing evolutionary networks of influenza virus
title_fullStr Mining, annotating and visualizing evolutionary networks of influenza virus
title_full_unstemmed Mining, annotating and visualizing evolutionary networks of influenza virus
title_sort mining, annotating and visualizing evolutionary networks of influenza virus
publishDate 2018
url http://hdl.handle.net/10356/74217
_version_ 1681045621951168512