Investigation and simulation of transfer reinforcement learning-based for robotic manipulation
Reinforcement learning is a process of investigating the interaction between agents and the environment, making sequential decisions, optimizing policies and maximizing cumulative returns. Reinforcement learning has great research value and application potential, which is a key step to realize gener...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Coursework |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/155421 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Reinforcement learning is a process of investigating the interaction between agents and the environment, making sequential decisions, optimizing policies and maximizing cumulative returns. Reinforcement learning has great research value and application potential, which is a key step to realize general artificial intelligence. This project introduces the principles and methods of reinforcement learning. The DRL algorithms based on Actor-Critic framework and HRL algorithm based on Option-Critic framework are verified and compared in Mujoco and RLBench robot simulation environments to complete complex robot tasks. The robot tasks using Mujoco as the back-end physical engine of the robot simulator are mainly low dimensional tasks with discrete inputs, including Humanoid, Hopper, HalfCheetah and Ant. In RLBench robot simulation environment, robot tasks are mainly high-dimensional tasks, whose inputs are images, including Open Box, Close Box, Pick Up Cup.
In low dimensional robotic tasks, the on-policy algorithm is far less efficient in data utilization than the off-policy algorithms that learn from experience replay. For the three off-policy algorithms, DDPG is far less effective than TD3 and SAC. Due to the lack of exploration ability of deterministic policy, the training variance of TD3 is large compared with stochastic policy algorithm SAC. From the convergence speed of reward, SAC has the best performance.
For high dimensional robotic tasks, only Option-Critic algorithm can solve Open Box and Close Box task. Due to the high memory limit, the off-policy algorithms can not be well implemented when retaining images in experience replay, so the agent cannot learn well from experience replay. Because the agent cannot use random exploration to obtain sparse reward signals to solve the task, no algorithm can solve more complex operation tasks, such as Pick Up Cup. |
---|