Reducing estimation bias via triplet-average deep deterministic policy gradient

The overestimation caused by function approximation is a well-known property in Q-learning algorithms, especially in single-critic models, which leads to poor performance in practical tasks. However, the opposite property, underestimation, which often occurs in Q-learning methods with double critics...

Full description

Saved in:

Bibliographic Details
Main Authors:	WU, Dongming, DONG, Xingping, SHEN, Jianbing, HOI, Steven C. H.
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2020
Subjects:	Averaging technology deep reinforcement learning (DRL) estimation bias triplet networks Numerical Analysis and Scientific Computing Software Engineering Theory and Algorithms
Online Access:	https://ink.library.smu.edu.sg/sis_research/5920 https://ink.library.smu.edu.sg/context/sis_research/article/6923/viewcontent/tnnls19ReducingBias_av.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-6923
record_format	dspace
spelling	sg-smu-ink.sis_research-69232021-05-10T09:40:12Z Reducing estimation bias via triplet-average deep deterministic policy gradient WU, Dongming DONG, Xingping SHEN, Jianbing HOI, Steven C. H. The overestimation caused by function approximation is a well-known property in Q-learning algorithms, especially in single-critic models, which leads to poor performance in practical tasks. However, the opposite property, underestimation, which often occurs in Q-learning methods with double critics, has been largely left untouched. In this article, we investigate the underestimation phenomenon in the recent twin delay deep deterministic actor-critic algorithm and theoretically demonstrate its existence. We also observe that this underestimation bias does indeed hurt performance in various experiments. Considering the opposite properties of single-critic and double-critic methods, we propose a novel triplet-average deep deterministic policy gradient algorithm that takes the weighted action value of three target critics to reduce the estimation bias. Given the connection between estimation bias and approximation error, we suggest averaging previous target values to reduce per-update error and further improve performance. Extensive empirical results over various continuous control tasks in OpenAI gym show that our approach outperforms the state-of-the-art methods. Source code available at https://github.com/shenjianbing/TADDRL. 2020-11-01T07:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/5920 info:doi/10.1109/TNNLS.2019.2959129 https://ink.library.smu.edu.sg/context/sis_research/article/6923/viewcontent/tnnls19ReducingBias_av.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Averaging technology deep reinforcement learning (DRL) estimation bias triplet networks Numerical Analysis and Scientific Computing Software Engineering Theory and Algorithms
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Averaging technology deep reinforcement learning (DRL) estimation bias triplet networks Numerical Analysis and Scientific Computing Software Engineering Theory and Algorithms
spellingShingle	Averaging technology deep reinforcement learning (DRL) estimation bias triplet networks Numerical Analysis and Scientific Computing Software Engineering Theory and Algorithms WU, Dongming DONG, Xingping SHEN, Jianbing HOI, Steven C. H. Reducing estimation bias via triplet-average deep deterministic policy gradient
description	The overestimation caused by function approximation is a well-known property in Q-learning algorithms, especially in single-critic models, which leads to poor performance in practical tasks. However, the opposite property, underestimation, which often occurs in Q-learning methods with double critics, has been largely left untouched. In this article, we investigate the underestimation phenomenon in the recent twin delay deep deterministic actor-critic algorithm and theoretically demonstrate its existence. We also observe that this underestimation bias does indeed hurt performance in various experiments. Considering the opposite properties of single-critic and double-critic methods, we propose a novel triplet-average deep deterministic policy gradient algorithm that takes the weighted action value of three target critics to reduce the estimation bias. Given the connection between estimation bias and approximation error, we suggest averaging previous target values to reduce per-update error and further improve performance. Extensive empirical results over various continuous control tasks in OpenAI gym show that our approach outperforms the state-of-the-art methods. Source code available at https://github.com/shenjianbing/TADDRL.
format	text
author	WU, Dongming DONG, Xingping SHEN, Jianbing HOI, Steven C. H.
author_facet	WU, Dongming DONG, Xingping SHEN, Jianbing HOI, Steven C. H.
author_sort	WU, Dongming
title	Reducing estimation bias via triplet-average deep deterministic policy gradient
title_short	Reducing estimation bias via triplet-average deep deterministic policy gradient
title_full	Reducing estimation bias via triplet-average deep deterministic policy gradient
title_fullStr	Reducing estimation bias via triplet-average deep deterministic policy gradient
title_full_unstemmed	Reducing estimation bias via triplet-average deep deterministic policy gradient
title_sort	reducing estimation bias via triplet-average deep deterministic policy gradient
publisher	Institutional Knowledge at Singapore Management University
publishDate	2020
url	https://ink.library.smu.edu.sg/sis_research/5920 https://ink.library.smu.edu.sg/context/sis_research/article/6923/viewcontent/tnnls19ReducingBias_av.pdf
_version_	1770575664420749312

Reducing estimation bias via triplet-average deep deterministic policy gradient

Similar Items