Mining, annotating and visualizing evolutionary networks of influenza virus
Influenza Virus is hosted by both avian and mammalian species. They evolve rapidly through the genetic shift and drift to escape antibody binding, this can cause seasonal epidemics and devastating pandemics. The classification of influenza genes into lineages is an important part of the analysis...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/74217 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Influenza Virus is hosted by both avian and mammalian species. They evolve rapidly
through the genetic shift and drift to escape antibody binding, this can cause seasonal
epidemics and devastating pandemics. The classification of influenza genes into lineages
is an important part of the analysis of viral sequence data.
WHOnFAOnOIE H5N1 evolution working group has specified criteria for defining
the clade from a phylogenetic tree for HA sequences that have evolved from A/-
goose/Guandong/1996 H5N1 virus. Independent studies have classified subtypes like
H1N1 and H9N2 into clades for establishing common nomenclature. Gene sequences
could be classified based on similarity to pre-defined lineages if lineages are known.
But there is a lack of tools that automatically produce clade information from input gene
sequences; manual inspections are tedious.
This research presents a novel approach MAVEN: Mining and Annotating Evolutionary
Network, to determine clade information of Influenza Virus of a particular subtype,
which has emerged as a consequence of selective genetic bottlenecks during transmission.
MAVEN uses combination of phylogenetic trees and unsupervised machine
learning algorithms to find the Clades. In this approach, Phylogenetic trees are constructed
using a fixed number of random HA sequences from the input sequences, for
each tree the sequence of its internal nodes is inferred using Fitch. Each node in the tree
is tested for non-significance, using student t-test on within and between distances for
the leaf node sequences present in two child nodes. Clustering algorithm run on these
selected nodes groups them into the set of clusters. A tree is constructed for each cluster
and a representative node (bottleneck sequence) is found. All the sequences are then
v
assigned to these representative based on their distance from the bottleneck sequence
forming the clades. This solution not only clusters clades based on lineages but also
expresses lineage relationship between each cluster in form of a tree.
We discuss a case study when MAVEN is applied on 7052 H1N1 HA sequences that
have been examined by a previous publication, and have already been classified into
clades. We then proceed to compare both cluster classifications using cluster validation
indexes like Entropy, Silhouette Coefficient, Dunn index and PearsonGamma, and note
that MAVEN performs better on all indexes.
While the influenza HA sequences were used for the purpose of this study, this approach
could be applied to any genes for lineage assignment. |
---|