Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning
We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of...
Saved in:
Main Authors: | , , |
---|---|
Format: | text |
Published: |
Archīum Ateneo
2018
|
Subjects: | |
Online Access: | https://archium.ateneo.edu/psychology-faculty-pubs/357 https://aclanthology.org/P18-1165/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Ateneo De Manila University |
id |
ph-ateneo-arc.psychology-faculty-pubs-1361 |
---|---|
record_format |
eprints |
spelling |
ph-ateneo-arc.psychology-faculty-pubs-13612022-04-04T11:36:52Z Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning Kreutzer, Julia Uyheng, Joshua Riezler, Stefan We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale. 2018-07-01T07:00:00Z text https://archium.ateneo.edu/psychology-faculty-pubs/357 https://aclanthology.org/P18-1165/ Psychology Department Faculty Publications Archīum Ateneo Cognitive Psychology Psychology |
institution |
Ateneo De Manila University |
building |
Ateneo De Manila University Library |
continent |
Asia |
country |
Philippines Philippines |
content_provider |
Ateneo De Manila University Library |
collection |
archium.Ateneo Institutional Repository |
topic |
Cognitive Psychology Psychology |
spellingShingle |
Cognitive Psychology Psychology Kreutzer, Julia Uyheng, Joshua Riezler, Stefan Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning |
description |
We present a study on reinforcement learning (RL) from human bandit feedback for sequence-to-sequence learning, exemplified by the task of bandit neural machine translation (NMT). We investigate the reliability of human bandit feedback, and analyze the influence of reliability on the learnability of a reward estimator, and the effect of the quality of reward estimates on the overall RL task. Our analysis of cardinal (5-point ratings) and ordinal (pairwise preferences) feedback shows that their intra- and inter-annotator α-agreement is comparable. Best reliability is obtained for standardized cardinal feedback, and cardinal feedback is also easiest to learn and generalize from. Finally, improvements of over 1 BLEU can be obtained by integrating a regression-based reward estimator trained on cardinal feedback for 800 translations into RL for NMT. This shows that RL is possible even from small amounts of fairly reliable human feedback, pointing to a great potential for applications at larger scale. |
format |
text |
author |
Kreutzer, Julia Uyheng, Joshua Riezler, Stefan |
author_facet |
Kreutzer, Julia Uyheng, Joshua Riezler, Stefan |
author_sort |
Kreutzer, Julia |
title |
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning |
title_short |
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning |
title_full |
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning |
title_fullStr |
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning |
title_full_unstemmed |
Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning |
title_sort |
reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning |
publisher |
Archīum Ateneo |
publishDate |
2018 |
url |
https://archium.ateneo.edu/psychology-faculty-pubs/357 https://aclanthology.org/P18-1165/ |
_version_ |
1729800149365948416 |