Improving sample efficiency using attention in deep reinforcement learning
Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making pro...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/150563 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-150563 |
---|---|
record_format |
dspace |
spelling |
sg-ntu-dr.10356-1505632021-06-14T01:09:19Z Improving sample efficiency using attention in deep reinforcement learning Ong, Dorvin Poh Jie Lee Bu Sung, Francis School of Computer Science and Engineering EBSLEE@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making problems. However, one limitation that has plagued reinforcement learning is the lacking sample efficiency. Reinforcement learning, amongst the three paradigms of machine learning, requires the most samples to produce a useful result. With more samples, more energy and time would be required to train a useful model, which is expensive. In this report, we conducted rigorous study into the reinforcement learning field, implemented the Proximal Policy Algorithm (PPO) and attempted to improve sample efficiency of reinforcement learning algorithms using self-attention models. Borrowing ideas from previous implementation of self-attention models, we experiment on variants of the Self-Attending Network(SAN) such as Channel-wise Self Attending (C-SAN) and Cross Attending Network (CAN), which is a combination of channel-column-wise and channel-row-wise attention. Our results have shown that CAN was distinctly more sample efficient than the original SAN and the vanilla PPO (No Attention) model in the game of Pong. However, shifting implementations towards Stable Baselines3 has returned results that differs from our findings in the earlier experiments. We attribute the discrepancy of the results to the implementation differences in the PPO algorithm. On the next experiment, we tested SAN, C-SAN and CAN on 49 Atari 2600 games. C-SAN was found to be better than the No Attention model by 15.36% on average while CAN and SAN were found to be worse by -14.44% and -1.47% respectively. Based on the results, we hypothesize that self-attention models could potentially perform better in complex environments because the benefits of a better state representation could facilitate learning a better policy. Further re-evaluation on more complex environments for a longer training duration has shown potential in CAN which managed to outperform other models. However, preliminary investigation of the reasons why self-attention works was inconclusive. Nevertheless, we provide some hypothesis in explaining the effect of self-attention models. Bachelor of Engineering (Computer Engineering) 2021-06-14T01:09:18Z 2021-06-14T01:09:18Z 2021 Final Year Project (FYP) Ong, D. P. J. (2021). Improving sample efficiency using attention in deep reinforcement learning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/150563 https://hdl.handle.net/10356/150563 en SCSE20-0527 application/pdf Nanyang Technological University |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence |
spellingShingle |
Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Ong, Dorvin Poh Jie Improving sample efficiency using attention in deep reinforcement learning |
description |
Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making problems. However, one limitation that has plagued reinforcement learning is the lacking sample efficiency. Reinforcement learning, amongst the three paradigms of machine learning, requires the most samples to produce a useful result. With more samples, more energy and time would be required to train a useful model, which is expensive. In this report, we conducted rigorous study into the reinforcement learning field, implemented the Proximal Policy Algorithm (PPO) and attempted to improve sample efficiency of reinforcement learning algorithms using self-attention models. Borrowing ideas from previous implementation of self-attention models, we experiment on variants of the Self-Attending Network(SAN) such as Channel-wise Self Attending (C-SAN) and Cross Attending Network (CAN), which is a combination of channel-column-wise and channel-row-wise attention. Our results have shown that CAN was distinctly more sample efficient than the original SAN and the vanilla PPO (No Attention) model in the game of Pong. However, shifting implementations towards Stable Baselines3 has returned results that differs from our findings in the earlier experiments. We attribute the discrepancy of the results to the implementation differences in the PPO algorithm. On the next experiment, we tested SAN, C-SAN and CAN on 49 Atari 2600 games. C-SAN was found to be better than the No Attention model by 15.36% on average while CAN and SAN were found to be worse by -14.44% and -1.47% respectively. Based on the results, we hypothesize that self-attention models could potentially perform better in complex environments because the benefits of a better state representation could facilitate learning a better policy. Further re-evaluation on more complex environments for a longer training duration has shown potential in CAN which managed to outperform other models. However, preliminary investigation of the reasons why self-attention works was inconclusive. Nevertheless, we provide some hypothesis in explaining the effect of self-attention models. |
author2 |
Lee Bu Sung, Francis |
author_facet |
Lee Bu Sung, Francis Ong, Dorvin Poh Jie |
format |
Final Year Project |
author |
Ong, Dorvin Poh Jie |
author_sort |
Ong, Dorvin Poh Jie |
title |
Improving sample efficiency using attention in deep reinforcement learning |
title_short |
Improving sample efficiency using attention in deep reinforcement learning |
title_full |
Improving sample efficiency using attention in deep reinforcement learning |
title_fullStr |
Improving sample efficiency using attention in deep reinforcement learning |
title_full_unstemmed |
Improving sample efficiency using attention in deep reinforcement learning |
title_sort |
improving sample efficiency using attention in deep reinforcement learning |
publisher |
Nanyang Technological University |
publishDate |
2021 |
url |
https://hdl.handle.net/10356/150563 |
_version_ |
1703971200868286464 |