Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs

Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-toend model-based gradient descent framework. This approach has proven e...

Full description

Saved in:

Bibliographic Details
Main Authors:	LOW, Siow Meng, KUMAR, Akshat, SANNER, Scott
Format:	text
Language:	English
Published:	Institutional Knowledge at Singapore Management University 2022
Subjects:	Artificial Intelligence and Robotics
Online Access:	https://ink.library.smu.edu.sg/sis_research/7724 https://ink.library.smu.edu.sg/context/sis_research/article/8727/viewcontent/21220_Article_Text_25233_1_2_20220628.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Singapore Management University
Language:	English

id	sg-smu-ink.sis_research-8727
record_format	dspace
spelling	sg-smu-ink.sis_research-87272023-01-10T02:51:27Z Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs LOW, Siow Meng KUMAR, Akshat SANNER, Scott Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-toend model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorizationmaximization perspective to iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it guarantees a monotonically improving objective under certain theoretical conditions, and (iii) it reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO is significantly more sampleefficient than the state-of-the-art DRP planner and consistently produces better solution quality with lower variance. We additionally demonstrate that ILBO generalizes well to new problem instances (i.e., different initial states) without requiring retraining. 2022-03-01T08:00:00Z text application/pdf https://ink.library.smu.edu.sg/sis_research/7724 https://ink.library.smu.edu.sg/context/sis_research/article/8727/viewcontent/21220_Article_Text_25233_1_2_20220628.pdf http://creativecommons.org/licenses/by-nc-nd/4.0/ Research Collection School Of Computing and Information Systems eng Institutional Knowledge at Singapore Management University Artificial Intelligence and Robotics
institution	Singapore Management University
building	SMU Libraries
continent	Asia
country	Singapore Singapore
content_provider	SMU Libraries
collection	InK@SMU
language	English
topic	Artificial Intelligence and Robotics
spellingShingle	Artificial Intelligence and Robotics LOW, Siow Meng KUMAR, Akshat SANNER, Scott Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs
description	Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-toend model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorizationmaximization perspective to iteratively optimize the DRP w.r.t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it guarantees a monotonically improving objective under certain theoretical conditions, and (iii) it reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO is significantly more sampleefficient than the state-of-the-art DRP planner and consistently produces better solution quality with lower variance. We additionally demonstrate that ILBO generalizes well to new problem instances (i.e., different initial states) without requiring retraining.
format	text
author	LOW, Siow Meng KUMAR, Akshat SANNER, Scott
author_facet	LOW, Siow Meng KUMAR, Akshat SANNER, Scott
author_sort	LOW, Siow Meng
title	Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs
title_short	Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs
title_full	Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs
title_fullStr	Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs
title_full_unstemmed	Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs
title_sort	sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous mdps
publisher	Institutional Knowledge at Singapore Management University
publishDate	2022
url	https://ink.library.smu.edu.sg/sis_research/7724 https://ink.library.smu.edu.sg/context/sis_research/article/8727/viewcontent/21220_Article_Text_25233_1_2_20220628.pdf
_version_	1770576421669830656

Sample-efficient iterative lower bound optimization of deep reactive policies for planning in continuous MDPs

Similar Items