Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kreutzer, Julia, Uyheng, Joshua, Riezler, Stefan
Format:	text
Published:	Archīum Ateneo 2018
Subjects:	Cognitive Psychology Psychology
Online Access:	https://archium.ateneo.edu/psychology-faculty-pubs/357 https://aclanthology.org/P18-1165/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Ateneo De Manila University

id	ph-ateneo-arc.psychology-faculty-pubs-1361
record_format	eprints
spelling	ph-ateneo-arc.psychology-faculty-pubs-13612022-04-04T11:36:52Z Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning Kreutzer, Julia Uyheng, Joshua Riezler, Stefan We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale. 2018-07-01T07:00:00Z text https://archium.ateneo.edu/psychology-faculty-pubs/357 https://aclanthology.org/P18-1165/ Psychology Department Faculty Publications Archīum Ateneo Cognitive Psychology Psychology
institution	Ateneo De Manila University
building	Ateneo De Manila University Library
continent	Asia
country	Philippines Philippines
content_provider	Ateneo De Manila University Library
collection	archium.Ateneo Institutional Repository
topic	Cognitive Psychology Psychology
spellingShingle	Cognitive Psychology Psychology Kreutzer, Julia Uyheng, Joshua Riezler, Stefan Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
description	We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.
format	text
author	Kreutzer, Julia Uyheng, Joshua Riezler, Stefan
author_facet	Kreutzer, Julia Uyheng, Joshua Riezler, Stefan
author_sort	Kreutzer, Julia
title	Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_short	Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_full	Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_fullStr	Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_full_unstemmed	Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_sort	reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning
publisher	Archīum Ateneo
publishDate	2018
url	https://archium.ateneo.edu/psychology-faculty-pubs/357 https://aclanthology.org/P18-1165/
_version_	1729800149365948416

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Similar Items