Improving sample efficiency using attention in deep reinforcement learning

Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making pro...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Dorvin Poh Jie
Other Authors: Lee Bu Sung, Francis
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/150563
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-150563
record_format dspace
spelling sg-ntu-dr.10356-1505632021-06-14T01:09:19Z Improving sample efficiency using attention in deep reinforcement learning Ong, Dorvin Poh Jie Lee Bu Sung, Francis School of Computer Science and Engineering EBSLEE@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making problems. However, one limitation that has plagued reinforcement learning is the lacking sample efficiency. Reinforcement learning, amongst the three paradigms of machine learning, requires the most samples to produce a useful result. With more samples, more energy and time would be required to train a useful model, which is expensive. In this report, we conducted rigorous study into the reinforcement learning field, implemented the Proximal Policy Algorithm (PPO) and attempted to improve sample efficiency of reinforcement learning algorithms using self-attention models. Borrowing ideas from previous implementation of self-attention models, we experiment on variants of the Self-Attending Network(SAN) such as Channel-wise Self Attending (C-SAN) and Cross Attending Network (CAN), which is a combination of channel-column-wise and channel-row-wise attention. Our results have shown that CAN was distinctly more sample efficient than the original SAN and the vanilla PPO (No Attention) model in the game of Pong. However, shifting implementations towards Stable Baselines3 has returned results that differs from our findings in the earlier experiments. We attribute the discrepancy of the results to the implementation differences in the PPO algorithm. On the next experiment, we tested SAN, C-SAN and CAN on 49 Atari 2600 games. C-SAN was found to be better than the No Attention model by 15.36% on average while CAN and SAN were found to be worse by -14.44% and -1.47% respectively. Based on the results, we hypothesize that self-attention models could potentially perform better in complex environments because the benefits of a better state representation could facilitate learning a better policy. Further re-evaluation on more complex environments for a longer training duration has shown potential in CAN which managed to outperform other models. However, preliminary investigation of the reasons why self-attention works was inconclusive. Nevertheless, we provide some hypothesis in explaining the effect of self-attention models. Bachelor of Engineering (Computer Engineering) 2021-06-14T01:09:18Z 2021-06-14T01:09:18Z 2021 Final Year Project (FYP) Ong, D. P. J. (2021). Improving sample efficiency using attention in deep reinforcement learning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/150563 https://hdl.handle.net/10356/150563 en SCSE20-0527 application/pdf Nanyang Technological University
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Ong, Dorvin Poh Jie
Improving sample efficiency using attention in deep reinforcement learning
description Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making problems. However, one limitation that has plagued reinforcement learning is the lacking sample efficiency. Reinforcement learning, amongst the three paradigms of machine learning, requires the most samples to produce a useful result. With more samples, more energy and time would be required to train a useful model, which is expensive. In this report, we conducted rigorous study into the reinforcement learning field, implemented the Proximal Policy Algorithm (PPO) and attempted to improve sample efficiency of reinforcement learning algorithms using self-attention models. Borrowing ideas from previous implementation of self-attention models, we experiment on variants of the Self-Attending Network(SAN) such as Channel-wise Self Attending (C-SAN) and Cross Attending Network (CAN), which is a combination of channel-column-wise and channel-row-wise attention. Our results have shown that CAN was distinctly more sample efficient than the original SAN and the vanilla PPO (No Attention) model in the game of Pong. However, shifting implementations towards Stable Baselines3 has returned results that differs from our findings in the earlier experiments. We attribute the discrepancy of the results to the implementation differences in the PPO algorithm. On the next experiment, we tested SAN, C-SAN and CAN on 49 Atari 2600 games. C-SAN was found to be better than the No Attention model by 15.36% on average while CAN and SAN were found to be worse by -14.44% and -1.47% respectively. Based on the results, we hypothesize that self-attention models could potentially perform better in complex environments because the benefits of a better state representation could facilitate learning a better policy. Further re-evaluation on more complex environments for a longer training duration has shown potential in CAN which managed to outperform other models. However, preliminary investigation of the reasons why self-attention works was inconclusive. Nevertheless, we provide some hypothesis in explaining the effect of self-attention models.
author2 Lee Bu Sung, Francis
author_facet Lee Bu Sung, Francis
Ong, Dorvin Poh Jie
format Final Year Project
author Ong, Dorvin Poh Jie
author_sort Ong, Dorvin Poh Jie
title Improving sample efficiency using attention in deep reinforcement learning
title_short Improving sample efficiency using attention in deep reinforcement learning
title_full Improving sample efficiency using attention in deep reinforcement learning
title_fullStr Improving sample efficiency using attention in deep reinforcement learning
title_full_unstemmed Improving sample efficiency using attention in deep reinforcement learning
title_sort improving sample efficiency using attention in deep reinforcement learning
publisher Nanyang Technological University
publishDate 2021
url https://hdl.handle.net/10356/150563
_version_ 1703971200868286464