Time-inconsistency in reinforcement learning: an equilibrium control paradigm

Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's...

Full description

Saved in:

Bibliographic Details
Main Author:	Lesmana, Nixie Sapphira
Other Authors:	Patrick Pun Chi Seng
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies
Online Access:	https://hdl.handle.net/10356/173187
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Nanyang Technological University
Language:	English

id	sg-ntu-dr.10356-173187
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies
spellingShingle	Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies Lesmana, Nixie Sapphira Time-inconsistency in reinforcement learning: an equilibrium control paradigm
description	Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's performance criterion. Such encoding has broad applications in risk-sensitive and human-centric domains such as finance, economics, and assistive robotics. Despite its importance, TIC in RL is difficult to handle as many amenable properties of globally optimal policy under the standard performance criterion fail to extend. One of the most important challenges is the non-applicability of Bellman's Principle of Optimality (BPO), from which many popular RL methods are drawn. In recent years, the subgame perfect equilibrium (SPE) control has risen as an important resolution to TIC in stochastic control, promising both amenable computations (through BPO recovery) and desirable control performance. SPE, as explained in behavioral economics, corresponds to a sophisticated human agent's behavior that handles TIC by taking future deviations as a constraint in their current plan construction, such that the resulting plan becomes time-consistent (TC). Through this thesis, SPE is introduced as a novel control objective/search target in TIC RL. We formalize the search problem as subgame perfect equilibrium reinforcement learning (SPERL) and develop novel SPERL methods for various TIC RL criteria in both finite-horizon and infinite-horizon settings. In the finite-horizon setting, we consider common TIC criteria in stochastic control problems: non-exponential discounting, mean-variance, and state-dependent rewards. We adapt into RL the extended dynamic programming (DP) theory from TIC stochastic control through policy iterations and develop a new convergence analysis for SPERL. Our results address the two main bottlenecks of applying standard RL methods to TIC criteria: the non-existence of recursive temporal-difference (TD) formula and update non-monotonicity. We then extend SPERL formalism to infinite-horizon settings and use it to address some open questions regarding the "optimality" and "convergence" of the standard policy iteration under TIC. Drawing on these results, we develop novel policy iteration and sample-based methods for the search of SPE in the infinite-horizon non-exponentially discounted criterion. Our experimental results highlight the importance of SPERL design choices, such as TD formulas and backward updates, in SPE learning performance and show our method's outperformance of some alternatives. Finally, we extend SPERL's scope of criterion by considering cumulative prospect theory (CPT). We develop novel CPT-SPERL methods, that depart from both the extended DP theory and policy iteration, by building on some recent progress in distributional RL. To support our methods, we develop new theories on CPT predictions and SPE characterizations, which in turn contribute to the relatively infant risk-sensitive distribution RL theory. Our experimental results demonstrate the efficacy of our methods in SPE learning. Moreover, by studying different classes of methods and optimality in the CPT context, we obtain new evidence for SPERL and SPE's desirability as a controller.
author2	Patrick Pun Chi Seng
author_facet	Patrick Pun Chi Seng Lesmana, Nixie Sapphira
format	Thesis-Doctor of Philosophy
author	Lesmana, Nixie Sapphira
author_sort	Lesmana, Nixie Sapphira
title	Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_short	Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_full	Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_fullStr	Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_full_unstemmed	Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_sort	time-inconsistency in reinforcement learning: an equilibrium control paradigm
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/173187
_version_	1789968691374325760
spelling	sg-ntu-dr.10356-1731872024-02-01T09:53:44Z Time-inconsistency in reinforcement learning: an equilibrium control paradigm Lesmana, Nixie Sapphira Patrick Pun Chi Seng School of Physical and Mathematical Sciences cspun@ntu.edu.sg Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's performance criterion. Such encoding has broad applications in risk-sensitive and human-centric domains such as finance, economics, and assistive robotics. Despite its importance, TIC in RL is difficult to handle as many amenable properties of globally optimal policy under the standard performance criterion fail to extend. One of the most important challenges is the non-applicability of Bellman's Principle of Optimality (BPO), from which many popular RL methods are drawn. In recent years, the subgame perfect equilibrium (SPE) control has risen as an important resolution to TIC in stochastic control, promising both amenable computations (through BPO recovery) and desirable control performance. SPE, as explained in behavioral economics, corresponds to a sophisticated human agent's behavior that handles TIC by taking future deviations as a constraint in their current plan construction, such that the resulting plan becomes time-consistent (TC). Through this thesis, SPE is introduced as a novel control objective/search target in TIC RL. We formalize the search problem as subgame perfect equilibrium reinforcement learning (SPERL) and develop novel SPERL methods for various TIC RL criteria in both finite-horizon and infinite-horizon settings. In the finite-horizon setting, we consider common TIC criteria in stochastic control problems: non-exponential discounting, mean-variance, and state-dependent rewards. We adapt into RL the extended dynamic programming (DP) theory from TIC stochastic control through policy iterations and develop a new convergence analysis for SPERL. Our results address the two main bottlenecks of applying standard RL methods to TIC criteria: the non-existence of recursive temporal-difference (TD) formula and update non-monotonicity. We then extend SPERL formalism to infinite-horizon settings and use it to address some open questions regarding the "optimality" and "convergence" of the standard policy iteration under TIC. Drawing on these results, we develop novel policy iteration and sample-based methods for the search of SPE in the infinite-horizon non-exponentially discounted criterion. Our experimental results highlight the importance of SPERL design choices, such as TD formulas and backward updates, in SPE learning performance and show our method's outperformance of some alternatives. Finally, we extend SPERL's scope of criterion by considering cumulative prospect theory (CPT). We develop novel CPT-SPERL methods, that depart from both the extended DP theory and policy iteration, by building on some recent progress in distributional RL. To support our methods, we develop new theories on CPT predictions and SPE characterizations, which in turn contribute to the relatively infant risk-sensitive distribution RL theory. Our experimental results demonstrate the efficacy of our methods in SPE learning. Moreover, by studying different classes of methods and optimality in the CPT context, we obtain new evidence for SPERL and SPE's desirability as a controller. Doctor of Philosophy 2024-01-17T02:12:19Z 2024-01-17T02:12:19Z 2023 Thesis-Doctor of Philosophy Lesmana, N. S. (2023). Time-inconsistency in reinforcement learning: an equilibrium control paradigm. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173187 https://hdl.handle.net/10356/173187 10.32657/10356/173187 en Nanyang President's Graduate Scholarship (NPGS) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Time-inconsistency in reinforcement learning: an equilibrium control paradigm

Similar Items