Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of...

Full description

Saved in:
Bibliographic Details
Main Authors: Kreutzer, Julia, Uyheng, Joshua, Riezler, Stefan
Format: text
Published: Archīum Ateneo 2018
Subjects:
Online Access:https://archium.ateneo.edu/psychology-faculty-pubs/357
https://aclanthology.org/P18-1165/
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Ateneo De Manila University
id ph-ateneo-arc.psychology-faculty-pubs-1361
record_format eprints
spelling ph-ateneo-arc.psychology-faculty-pubs-13612022-04-04T11:36:52Z Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning Kreutzer, Julia Uyheng, Joshua Riezler, Stefan We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale. 2018-07-01T07:00:00Z text https://archium.ateneo.edu/psychology-faculty-pubs/357 https://aclanthology.org/P18-1165/ Psychology Department Faculty Publications Archīum Ateneo Cognitive Psychology Psychology
institution Ateneo De Manila University
building Ateneo De Manila University Library
continent Asia
country Philippines
Philippines
content_provider Ateneo De Manila University Library
collection archium.Ateneo Institutional Repository
topic Cognitive Psychology
Psychology
spellingShingle Cognitive Psychology
Psychology
Kreutzer, Julia
Uyheng, Joshua
Riezler, Stefan
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
description We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale.
format text
author Kreutzer, Julia
Uyheng, Joshua
Riezler, Stefan
author_facet Kreutzer, Julia
Uyheng, Joshua
Riezler, Stefan
author_sort Kreutzer, Julia
title Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_short Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_full Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_fullStr Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_full_unstemmed Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
title_sort reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning
publisher Archīum Ateneo
publishDate 2018
url https://archium.ateneo.edu/psychology-faculty-pubs/357
https://aclanthology.org/P18-1165/
_version_ 1729800149365948416