Mining, annotating and visualizing evolutionary networks of influenza virus

Influenza Virus is hosted by both avian and mammalian species. They evolve rapidly through the genetic shift and drift to escape antibody binding, this can cause seasonal epidemics and devastating pandemics. The classification of influenza genes into lineages is an important part of the analysis...

Full description

Saved in:
Bibliographic Details
Main Author: Deshpande Akhila Sameer
Other Authors: Kwoh Chee Keong
Format: Theses and Dissertations
Language:English
Published: 2018
Subjects:
Online Access:http://hdl.handle.net/10356/74217
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Influenza Virus is hosted by both avian and mammalian species. They evolve rapidly through the genetic shift and drift to escape antibody binding, this can cause seasonal epidemics and devastating pandemics. The classification of influenza genes into lineages is an important part of the analysis of viral sequence data. WHOnFAOnOIE H5N1 evolution working group has specified criteria for defining the clade from a phylogenetic tree for HA sequences that have evolved from A/- goose/Guandong/1996 H5N1 virus. Independent studies have classified subtypes like H1N1 and H9N2 into clades for establishing common nomenclature. Gene sequences could be classified based on similarity to pre-defined lineages if lineages are known. But there is a lack of tools that automatically produce clade information from input gene sequences; manual inspections are tedious. This research presents a novel approach MAVEN: Mining and Annotating Evolutionary Network, to determine clade information of Influenza Virus of a particular subtype, which has emerged as a consequence of selective genetic bottlenecks during transmission. MAVEN uses combination of phylogenetic trees and unsupervised machine learning algorithms to find the Clades. In this approach, Phylogenetic trees are constructed using a fixed number of random HA sequences from the input sequences, for each tree the sequence of its internal nodes is inferred using Fitch. Each node in the tree is tested for non-significance, using student t-test on within and between distances for the leaf node sequences present in two child nodes. Clustering algorithm run on these selected nodes groups them into the set of clusters. A tree is constructed for each cluster and a representative node (bottleneck sequence) is found. All the sequences are then v assigned to these representative based on their distance from the bottleneck sequence forming the clades. This solution not only clusters clades based on lineages but also expresses lineage relationship between each cluster in form of a tree. We discuss a case study when MAVEN is applied on 7052 H1N1 HA sequences that have been examined by a previous publication, and have already been classified into clades. We then proceed to compare both cluster classifications using cluster validation indexes like Entropy, Silhouette Coefficient, Dunn index and PearsonGamma, and note that MAVEN performs better on all indexes. While the influenza HA sequences were used for the purpose of this study, this approach could be applied to any genes for lineage assignment.