Improving sample efficiency using attention in deep reinforcement learning

Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making pro...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Dorvin Poh Jie
Other Authors: Lee Bu Sung, Francis
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/150563
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making problems. However, one limitation that has plagued reinforcement learning is the lacking sample efficiency. Reinforcement learning, amongst the three paradigms of machine learning, requires the most samples to produce a useful result. With more samples, more energy and time would be required to train a useful model, which is expensive. In this report, we conducted rigorous study into the reinforcement learning field, implemented the Proximal Policy Algorithm (PPO) and attempted to improve sample efficiency of reinforcement learning algorithms using self-attention models. Borrowing ideas from previous implementation of self-attention models, we experiment on variants of the Self-Attending Network(SAN) such as Channel-wise Self Attending (C-SAN) and Cross Attending Network (CAN), which is a combination of channel-column-wise and channel-row-wise attention. Our results have shown that CAN was distinctly more sample efficient than the original SAN and the vanilla PPO (No Attention) model in the game of Pong. However, shifting implementations towards Stable Baselines3 has returned results that differs from our findings in the earlier experiments. We attribute the discrepancy of the results to the implementation differences in the PPO algorithm. On the next experiment, we tested SAN, C-SAN and CAN on 49 Atari 2600 games. C-SAN was found to be better than the No Attention model by 15.36% on average while CAN and SAN were found to be worse by -14.44% and -1.47% respectively. Based on the results, we hypothesize that self-attention models could potentially perform better in complex environments because the benefits of a better state representation could facilitate learning a better policy. Further re-evaluation on more complex environments for a longer training duration has shown potential in CAN which managed to outperform other models. However, preliminary investigation of the reasons why self-attention works was inconclusive. Nevertheless, we provide some hypothesis in explaining the effect of self-attention models.