Time-inconsistency in reinforcement learning: an equilibrium control paradigm
Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/173187 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
id |
sg-ntu-dr.10356-173187 |
---|---|
record_format |
dspace |
institution |
Nanyang Technological University |
building |
NTU Library |
continent |
Asia |
country |
Singapore Singapore |
content_provider |
NTU Library |
collection |
DR-NTU |
language |
English |
topic |
Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies |
spellingShingle |
Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies Lesmana, Nixie Sapphira Time-inconsistency in reinforcement learning: an equilibrium control paradigm |
description |
Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's performance criterion. Such encoding has broad applications in risk-sensitive and human-centric domains such as finance, economics, and assistive robotics. Despite its importance, TIC in RL is difficult to handle as many amenable properties of globally optimal policy under the standard performance criterion fail to extend. One of the most important challenges is the non-applicability of Bellman's Principle of Optimality (BPO), from which many popular RL methods are drawn. In recent years, the subgame perfect equilibrium (SPE) control has risen as an important resolution to TIC in stochastic control, promising both amenable computations (through BPO recovery) and desirable control performance. SPE, as explained in behavioral economics, corresponds to a sophisticated human agent's behavior that handles TIC by taking future deviations as a constraint in their current plan construction, such that the resulting plan becomes time-consistent (TC).
Through this thesis, SPE is introduced as a novel control objective/search target in TIC RL. We formalize the search problem as subgame perfect equilibrium reinforcement learning (SPERL) and develop novel SPERL methods for various TIC RL criteria in both finite-horizon and infinite-horizon settings. In the finite-horizon setting, we consider common TIC criteria in stochastic control problems: non-exponential discounting, mean-variance, and state-dependent rewards. We adapt into RL the extended dynamic programming (DP) theory from TIC stochastic control through policy iterations and develop a new convergence analysis for SPERL. Our results address the two main bottlenecks of applying standard RL methods to TIC criteria: the non-existence of recursive temporal-difference (TD) formula and update non-monotonicity. We then extend SPERL formalism to infinite-horizon settings and use it to address some open questions regarding the "optimality" and "convergence" of the standard policy iteration under TIC. Drawing on these results, we develop novel policy iteration and sample-based methods for the search of SPE in the infinite-horizon non-exponentially discounted criterion. Our experimental results highlight the importance of SPERL design choices, such as TD formulas and backward updates, in SPE learning performance and show our method's outperformance of some alternatives. Finally, we extend SPERL's scope of criterion by considering cumulative prospect theory (CPT). We develop novel CPT-SPERL methods, that depart from both the extended DP theory and policy iteration, by building on some recent progress in distributional RL. To support our methods, we develop new theories on CPT predictions and SPE characterizations, which in turn contribute to the relatively infant risk-sensitive distribution RL theory. Our experimental results demonstrate the efficacy of our methods in SPE learning. Moreover, by studying different classes of methods and optimality in the CPT context, we obtain new evidence for SPERL and SPE's desirability as a controller. |
author2 |
Patrick Pun Chi Seng |
author_facet |
Patrick Pun Chi Seng Lesmana, Nixie Sapphira |
format |
Thesis-Doctor of Philosophy |
author |
Lesmana, Nixie Sapphira |
author_sort |
Lesmana, Nixie Sapphira |
title |
Time-inconsistency in reinforcement learning: an equilibrium control paradigm |
title_short |
Time-inconsistency in reinforcement learning: an equilibrium control paradigm |
title_full |
Time-inconsistency in reinforcement learning: an equilibrium control paradigm |
title_fullStr |
Time-inconsistency in reinforcement learning: an equilibrium control paradigm |
title_full_unstemmed |
Time-inconsistency in reinforcement learning: an equilibrium control paradigm |
title_sort |
time-inconsistency in reinforcement learning: an equilibrium control paradigm |
publisher |
Nanyang Technological University |
publishDate |
2024 |
url |
https://hdl.handle.net/10356/173187 |
_version_ |
1789968691374325760 |
spelling |
sg-ntu-dr.10356-1731872024-02-01T09:53:44Z Time-inconsistency in reinforcement learning: an equilibrium control paradigm Lesmana, Nixie Sapphira Patrick Pun Chi Seng School of Physical and Mathematical Sciences cspun@ntu.edu.sg Science::Mathematics::Applied mathematics Engineering::Computer science and engineering::Computing methodologies Time inconsistency (TIC) describes a situation in which a plan, consisting of current and future actions, that is optimal today may no longer be optimal in the future. In reinforcement learning (RL), TIC often arises as we encode realistic human preferences or specific behaviors into an agent's performance criterion. Such encoding has broad applications in risk-sensitive and human-centric domains such as finance, economics, and assistive robotics. Despite its importance, TIC in RL is difficult to handle as many amenable properties of globally optimal policy under the standard performance criterion fail to extend. One of the most important challenges is the non-applicability of Bellman's Principle of Optimality (BPO), from which many popular RL methods are drawn. In recent years, the subgame perfect equilibrium (SPE) control has risen as an important resolution to TIC in stochastic control, promising both amenable computations (through BPO recovery) and desirable control performance. SPE, as explained in behavioral economics, corresponds to a sophisticated human agent's behavior that handles TIC by taking future deviations as a constraint in their current plan construction, such that the resulting plan becomes time-consistent (TC). Through this thesis, SPE is introduced as a novel control objective/search target in TIC RL. We formalize the search problem as subgame perfect equilibrium reinforcement learning (SPERL) and develop novel SPERL methods for various TIC RL criteria in both finite-horizon and infinite-horizon settings. In the finite-horizon setting, we consider common TIC criteria in stochastic control problems: non-exponential discounting, mean-variance, and state-dependent rewards. We adapt into RL the extended dynamic programming (DP) theory from TIC stochastic control through policy iterations and develop a new convergence analysis for SPERL. Our results address the two main bottlenecks of applying standard RL methods to TIC criteria: the non-existence of recursive temporal-difference (TD) formula and update non-monotonicity. We then extend SPERL formalism to infinite-horizon settings and use it to address some open questions regarding the "optimality" and "convergence" of the standard policy iteration under TIC. Drawing on these results, we develop novel policy iteration and sample-based methods for the search of SPE in the infinite-horizon non-exponentially discounted criterion. Our experimental results highlight the importance of SPERL design choices, such as TD formulas and backward updates, in SPE learning performance and show our method's outperformance of some alternatives. Finally, we extend SPERL's scope of criterion by considering cumulative prospect theory (CPT). We develop novel CPT-SPERL methods, that depart from both the extended DP theory and policy iteration, by building on some recent progress in distributional RL. To support our methods, we develop new theories on CPT predictions and SPE characterizations, which in turn contribute to the relatively infant risk-sensitive distribution RL theory. Our experimental results demonstrate the efficacy of our methods in SPE learning. Moreover, by studying different classes of methods and optimality in the CPT context, we obtain new evidence for SPERL and SPE's desirability as a controller. Doctor of Philosophy 2024-01-17T02:12:19Z 2024-01-17T02:12:19Z 2023 Thesis-Doctor of Philosophy Lesmana, N. S. (2023). Time-inconsistency in reinforcement learning: an equilibrium control paradigm. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173187 https://hdl.handle.net/10356/173187 10.32657/10356/173187 en Nanyang President's Graduate Scholarship (NPGS) This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University |