A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices

© 2020 Watson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Genetic surveillance of malaria parasites s...

Full description

Saved in:
Bibliographic Details
Main Authors: James A. Watson, Aimee R. Taylor, Elizabeth A. Ashley, Arjen Dondorp, Caroline O. Buckee, Nicholas J. White, Chris C. Holmes
Other Authors: Harvard T.H. Chan School of Public Health
Format: Article
Published: 2020
Subjects:
Online Access:https://repository.li.mahidol.ac.th/handle/123456789/59807
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Mahidol University
id th-mahidol.59807
record_format dspace
spelling th-mahidol.598072020-11-18T16:58:17Z A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices James A. Watson Aimee R. Taylor Elizabeth A. Ashley Arjen Dondorp Caroline O. Buckee Nicholas J. White Chris C. Holmes Harvard T.H. Chan School of Public Health University of Oxford Mahosot Hospital, Lao Mahidol University Nuffield Department of Medicine Broad Institute Agricultural and Biological Sciences Biochemistry, Genetics and Molecular Biology Medicine © 2020 Watson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results. 2020-11-18T07:40:50Z 2020-11-18T07:40:50Z 2020-10-09 Article PLoS Genetics. Vol.16, No.10 (2020) 10.1371/journal.pgen.1009037 15537404 15537390 2-s2.0-85092928737 https://repository.li.mahidol.ac.th/handle/123456789/59807 Mahidol University SCOPUS https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85092928737&origin=inward
institution Mahidol University
building Mahidol University Library
continent Asia
country Thailand
Thailand
content_provider Mahidol University Library
collection Mahidol University Institutional Repository
topic Agricultural and Biological Sciences
Biochemistry, Genetics and Molecular Biology
Medicine
spellingShingle Agricultural and Biological Sciences
Biochemistry, Genetics and Molecular Biology
Medicine
James A. Watson
Aimee R. Taylor
Elizabeth A. Ashley
Arjen Dondorp
Caroline O. Buckee
Nicholas J. White
Chris C. Holmes
A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
description © 2020 Watson et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results.
author2 Harvard T.H. Chan School of Public Health
author_facet Harvard T.H. Chan School of Public Health
James A. Watson
Aimee R. Taylor
Elizabeth A. Ashley
Arjen Dondorp
Caroline O. Buckee
Nicholas J. White
Chris C. Holmes
format Article
author James A. Watson
Aimee R. Taylor
Elizabeth A. Ashley
Arjen Dondorp
Caroline O. Buckee
Nicholas J. White
Chris C. Holmes
author_sort James A. Watson
title A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_short A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_full A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_fullStr A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_full_unstemmed A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
title_sort cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
publishDate 2020
url https://repository.li.mahidol.ac.th/handle/123456789/59807
_version_ 1763488803689857024