Time-inconsistency in reinforcement learning: an equilibrium control paradigm

Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's...

Full description

Saved in:
Bibliographic Details
Main Author: Lesmana, Nixie Sapphira
Other Authors: Patrick Pun Chi Seng
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/173187
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-173187
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Science::Mathematics::Applied mathematics
Engineering::Computer science and engineering::Computing methodologies
spellingShingle Science::Mathematics::Applied mathematics
Engineering::Computer science and engineering::Computing methodologies
Lesmana, Nixie Sapphira
Time-inconsistency in reinforcement learning: an equilibrium control paradigm
description Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's performance criterion. Such encoding has broad applications in risk-sensitive and human-centric domains such as finance, economics, and assistive robotics. Despite its importance, TIC in RL is difficult to handle as many amenable properties of globally optimal policy under the standard performance criterion fail to extend. One of the most important challenges is the non-applicability of Bellman's Principle of Optimality (BPO), from which many popular RL methods are drawn. In recent years, the subgame perfect equilibrium (SPE) control has risen as an important resolution to TIC in stochastic control, promising both amenable computations (through BPO recovery) and desirable control performance. SPE, as explained in behavioral economics, corresponds to a sophisticated human agent's behavior that handles TIC by taking future deviations as a constraint in their current plan construction, such that the resulting plan becomes time-consistent (TC). Through this thesis, SPE is introduced as a novel control objective/search target in TIC RL. We formalize the search problem as subgame perfect equilibrium reinforcement learning (SPERL) and develop novel SPERL methods for various TIC RL criteria in both finite-horizon and infinite-horizon settings. In the finite-horizon setting, we consider common TIC criteria in stochastic control problems: non-exponential discounting, mean-variance, and state-dependent rewards. We adapt into RL the extended dynamic programming (DP) theory from TIC stochastic control through policy iterations and develop a new convergence analysis for SPERL. Our results address the two main bottlenecks of applying standard RL methods to TIC criteria: the non-existence of recursive temporal-difference (TD) formula and update non-monotonicity. We then extend SPERL formalism to infinite-horizon settings and use it to address some open questions regarding the "optimality" and "convergence" of the standard policy iteration under TIC. Drawing on these results, we develop novel policy iteration and sample-based methods for the search of SPE in the infinite-horizon non-exponentially discounted criterion. Our experimental results highlight the importance of SPERL design choices, such as TD formulas and backward updates, in SPE learning performance and show our method's outperformance of some alternatives. Finally, we extend SPERL's scope of criterion by considering cumulative prospect theory (CPT). We develop novel CPT-SPERL methods, that depart from both the extended DP theory and policy iteration, by building on some recent progress in distributional RL. To support our methods, we develop new theories on CPT predictions and SPE characterizations, which in turn contribute to the relatively infant risk-sensitive distribution RL theory. Our experimental results demonstrate the efficacy of our methods in SPE learning. Moreover, by studying different classes of methods and optimality in the CPT context, we obtain new evidence for SPERL and SPE's desirability as a controller.
author2 Patrick Pun Chi Seng
author_facet Patrick Pun Chi Seng
Lesmana, Nixie Sapphira
format Thesis-Doctor of Philosophy
author Lesmana, Nixie Sapphira
author_sort Lesmana, Nixie Sapphira
title Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_short Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_full Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_fullStr Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_full_unstemmed Time-inconsistency in reinforcement learning: an equilibrium control paradigm
title_sort time-inconsistency in reinforcement learning: an equilibrium control paradigm
publisher Nanyang Technological University
publishDate 2024
url https://hdl.handle.net/10356/173187
_version_ 1789968691374325760
spelling sg-ntu-dr.10356-1731872024-02-01T09:53:44Z Time-inconsistency in reinforcement learning: an equilibrium control paradigm Lesmana, Nixie Sapphira Patrick Pun Chi Seng School of Physical and Mathematical Sciences cspun@ntu.edu.sg Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's performance criterion. Such encoding has broad applications in risk-sensitive and human-centric domains such as finance, economics, and assistive robotics. Despite its importance, TIC in RL is difficult to handle as many amenable properties of globally optimal policy under the standard performance criterion fail to extend. One of the most important challenges is the non-applicability of Bellman's Principle of Optimality (BPO), from which many popular RL methods are drawn. In recent years, the subgame perfect equilibrium (SPE) control has risen as an important resolution to TIC in stochastic control, promising both amenable computations (through BPO recovery) and desirable control performance. SPE, as explained in behavioral economics, corresponds to a sophisticated human agent's behavior that handles TIC by taking future deviations as a constraint in their current plan construction, such that the resulting plan becomes time-consistent (TC). Through this thesis, SPE is introduced as a novel control objective/search target in TIC RL. We formalize the search problem as subgame perfect equilibrium reinforcement learning (SPERL) and develop novel SPERL methods for various TIC RL criteria in both finite-horizon and infinite-horizon settings. In the finite-horizon setting, we consider common TIC criteria in stochastic control problems: non-exponential discounting, mean-variance, and state-dependent rewards. We adapt into RL the extended dynamic programming (DP) theory from TIC stochastic control through policy iterations and develop a new convergence analysis for SPERL. Our results address the two main bottlenecks of applying standard RL methods to TIC criteria: the non-existence of recursive temporal-difference (TD) formula and update non-monotonicity. We then extend SPERL formalism to infinite-horizon settings and use it to address some open questions regarding the "optimality" and "convergence" of the standard policy iteration under TIC. Drawing on these results, we develop novel policy iteration and sample-based methods for the search of SPE in the infinite-horizon non-exponentially discounted criterion. Our experimental results highlight the importance of SPERL design choices, such as TD formulas and backward updates, in SPE learning performance and show our method's outperformance of some alternatives. Finally, we extend SPERL's scope of criterion by considering cumulative prospect theory (CPT). We develop novel CPT-SPERL methods, that depart from both the extended DP theory and policy iteration, by building on some recent progress in distributional RL. To support our methods, we develop new theories on CPT predictions and SPE characterizations, which in turn contribute to the relatively infant risk-sensitive distribution RL theory. Our experimental results demonstrate the efficacy of our methods in SPE learning. Moreover, by studying different classes of methods and optimality in the CPT context, we obtain new evidence for SPERL and SPE's desirability as a controller. Doctor of Philosophy 2024-01-17T02:12:19Z 2024-01-17T02:12:19Z 2023 Thesis-Doctor of Philosophy Lesmana, N. S. (2023). Time-inconsistency in reinforcement learning: an equilibrium control paradigm. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173187 https://hdl.handle.net/10356/173187 10.32657/10356/173187 en Nanyang President's Graduate Scholarship (NPGS) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University