Investigating sim-to-real transfer for reinforcement learning-based robotic manipulation

In this project, model-free Deep Reinforcement Learning (DRL) algorithms were implemented to solve complex robotic environments. These include low- dimensional and high-dimensional robotic tasks. Low-dimensional tasks have state inputs that are discrete values such as robotic arm joint angles, posit...

Full description

Saved in:
Bibliographic Details
Main Author: Cheng, Jason Kuan Yong
Other Authors: Soong Boon Hee
Format: Final Year Project
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148803
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:In this project, model-free Deep Reinforcement Learning (DRL) algorithms were implemented to solve complex robotic environments. These include low- dimensional and high-dimensional robotic tasks. Low-dimensional tasks have state inputs that are discrete values such as robotic arm joint angles, position, and velocity. High-dimensional tasks have state inputs that are images, with camera views of the environment in various angles. The low dimensional robotic environments involve CartPole Continuous, Hop- per, Half-Cheetah and Ant Bullet environments using the PyBullet (Coumans and Bai, 2016–2019) as the back-end physics engine for the robotic simulator. The high dimensional robotic manipulation tasks involve Open Box, Close Box, Pick- up Cup, and Scoop with Spatula, from the RLBench (James et al., 2020) task implementations. From the results of the experiments, off-policy algorithms like Deep Deter- ministic Policy Gradients (DDPG) and Twin-Delayed Deep Deterministic Policy Gradeints (TD3) outperformed the other algorithms on low dimensional tasks due to learning from experience replay, thereby having superior sample efficiency compared to on-policy algorithms like Trust Region Policy Optimisation (TRPO) and Proximal Policy Optimisation (PPO). For the high-dimensional environments, only the Option-Critic algorithm was able to solve some of the environments like open-box and close-box. Off-policy algorithms do not perform well due to the high memory constraint related to holding images in experience replay, thus the agent could not learn well from experience replay. On-policy algorithms are also not able to learn well from high-dimensional environments as they are unable to generalise due to the sparse reward signals. No algorithms implemented were able to solve the more complex manipulation tasks like scoop with spatula and pick-up cup as the agent were not able to solve the task using random exploration to get the sparse reward signal to learn from.