Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning

Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly...

Full description

Saved in:
Bibliographic Details
Main Authors: Chegini, Mohammad, Bernard, Jürgen, Berger, Philip, Sourin, Alexei, Andrews, Keith, Schreck, Tobias
Other Authors: School of Computer Science and Engineering
Format: Article
Language:English
Published: 2019
Subjects:
Online Access:https://hdl.handle.net/10356/86031
http://hdl.handle.net/10220/49845
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-86031
record_format dspace
spelling sg-ntu-dr.10356-860312020-03-07T11:48:58Z Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning Chegini, Mohammad Bernard, Jürgen Berger, Philip Sourin, Alexei Andrews, Keith Schreck, Tobias School of Computer Science and Engineering Engineering::Computer science and engineering Labelling Clustering Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions. Published version 2019-09-03T04:29:04Z 2019-12-06T16:14:43Z 2019-09-03T04:29:04Z 2019-12-06T16:14:43Z 2019 Journal Article Chegini, M., Bernard, J., Berger, P., Sourin, A., Andrews, K., & Schreck, T. (2019). Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning. Visual Informatics, 3(1), 9-17. doi:10.1016/j.visinf.2019.03.002 https://hdl.handle.net/10356/86031 http://hdl.handle.net/10220/49845 10.1016/j.visinf.2019.03.002 en Visual Informatics © 2019 Zhejiang University and Zhejiang University Press. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 9 p. application/pdf
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic Engineering::Computer science and engineering
Labelling
Clustering
spellingShingle Engineering::Computer science and engineering
Labelling
Clustering
Chegini, Mohammad
Bernard, Jürgen
Berger, Philip
Sourin, Alexei
Andrews, Keith
Schreck, Tobias
Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning
description Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions.
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Chegini, Mohammad
Bernard, Jürgen
Berger, Philip
Sourin, Alexei
Andrews, Keith
Schreck, Tobias
format Article
author Chegini, Mohammad
Bernard, Jürgen
Berger, Philip
Sourin, Alexei
Andrews, Keith
Schreck, Tobias
author_sort Chegini, Mohammad
title Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning
title_short Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning
title_full Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning
title_fullStr Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning
title_full_unstemmed Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning
title_sort interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning
publishDate 2019
url https://hdl.handle.net/10356/86031
http://hdl.handle.net/10220/49845
_version_ 1681034018703802368