Robust and adaptive decision-making: a reinforcement learning perspective

How to make decisions in complex and uncertain environments is a challenging and crucial task. Adversaries and perturbations in these environments disrupt existing policies, while the dynamic nature of the environments renders policies obsolete. Therefore, it is vital to learn robust policies capabl...

Full description

Saved in:
Bibliographic Details
Main Author: Xue, Wanqi
Other Authors: Bo An
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/173125
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:How to make decisions in complex and uncertain environments is a challenging and crucial task. Adversaries and perturbations in these environments disrupt existing policies, while the dynamic nature of the environments renders policies obsolete. Therefore, it is vital to learn robust policies capable of maintaining optimal performance in extreme scenarios and quickly adapting to changes. In this thesis, we focus on three real-world problem domains: network security games (NSGs), inter-agent communication and sequential recommendation. All of these domains necessitates robust and adaptive decision-making. Our first emphasis is to learn robust defending policies in NSGs. In this domain, two algorithms are designed to improve scalability and data efficiency, respectively. First, we propose NSG-NFSP, a novel approach that aims to find Nash equilibria in NSGs on a large scale. NSG-NFSP employs deep neural networks to learn mappings from state-action pairs to values, representing either Q-values or probabilities. NSG-NFSP surpasses state-of-the-art algorithms in terms of scalability and solution quality. Second, we introduce NSGZero, a data-efficient learning method for acquiring non-exploitable policies in NSGs. NSGZero incorporates three neural networks, i.e., the dynamics network, the value network, and the prior network, to facilitate efficient Monte Carlo tree search (MCTS) in NSGs. Furthermore, we integrate decentralized control into neural MCTS, enabling NSGZero to handle NSGs with a large number of security resources. Extensive experiments on diverse NSGs with various graph structures and scales demonstrate the superior performance of NSGZero, even with limited training experiences. The next focus of this thesis is on addressing the problem of robust communication in multi-agent communicative reinforcement learning (MACRL), which is a topic that has been largely neglected before. We provide a formal definition of adversarial communication and propose an effective method for modeling message attacks in MACRL. We design a two-stage message filter to defend against message attacks. To enhance robustness, we formulate the adversarial communication problem as a two-player zero-sum game and design the algorithm, R-MACRL, to solve the game. Extensive experiments across different algorithms and tasks reveal the vulnerability of state-of-the-art MACRL methods to message attacks, while our proposed algorithm consistently restores the multi-agent cooperation and improves the robustness of MACRL algorithms under message attacks. Furthermore, we investigate the problem of adapting recommendation policies to newly collected data for optimizing long-term user engagement in sequential recommendation. Two reinforcement learning algorithms are developed to learn policies with and without explicitly designed rewards. First, we introduce ResAct, an algorithm that improves the performance of recommender systems with pre-defined rewards. ResAct reconstructs the behavior of the online-serving policy and enhances it by adding a residual to the actions, resulting in a policy that closely aligns with the original policy but performs better. To improve the expressiveness and conciseness of state representations, we design two information-theoretical regularizers. Empirical evaluation demonstrates that ResAct outperforms previous state-of-the-art algorithms across all tasks. Additionally, we propose PrefRec which learns recommendation policies from preferences between users’ historical behaviors rather than pre-defined rewards. This approach leverages the strengths of RL, such as optimizing long-term goals, while avoiding the complexities of reward engineering. PrefRec automatically learns a reward function from preferences and uses it to generate reinforcement signals for training the recommendation policy. We design an effective optimization method for PrefRec, utilizing an additional value function, expectile regression, and reward function pre-training to enhance performance. Experimental results highlight the significant performance improvements of PrefRec over the current state-of-the-art across various long-term user engagement optimization tasks.