Robust and adaptive decision-making: a reinforcement learning perspective

How to make decisions in complex and uncertain environments is a challenging and crucial task. Adversaries and perturbations in these environments disrupt existing policies, while the dynamic nature of the environments renders policies obsolete. Therefore, it is vital to learn robust policies capabl...

全面介紹

Saved in:

書目詳細資料
主要作者:	Xue, Wanqi
其他作者:	Bo An
格式:	Thesis-Doctor of Philosophy
語言:	English
出版:	Nanyang Technological University 2024
主題:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
在線閱讀:	https://hdl.handle.net/10356/173125
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!
機構:	Nanyang Technological University
語言:	English

id	sg-ntu-dr.10356-173125
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Xue, Wanqi Robust and adaptive decision-making: a reinforcement learning perspective
description	How to make decisions in complex and uncertain environments is a challenging and crucial task. Adversaries and perturbations in these environments disrupt existing policies, while the dynamic nature of the environments renders policies obsolete. Therefore, it is vital to learn robust policies capable of maintaining optimal performance in extreme scenarios and quickly adapting to changes. In this thesis, we focus on three real-world problem domains: network security games (NSGs), inter-agent communication and sequential recommendation. All of these domains necessitates robust and adaptive decision-making. Our first emphasis is to learn robust defending policies in NSGs. In this domain, two algorithms are designed to improve scalability and data efficiency, respectively. First, we propose NSG-NFSP, a novel approach that aims to find Nash equilibria in NSGs on a large scale. NSG-NFSP employs deep neural networks to learn mappings from state-action pairs to values, representing either Q-values or probabilities. NSG-NFSP surpasses state-of-the-art algorithms in terms of scalability and solution quality. Second, we introduce NSGZero, a data-efficient learning method for acquiring non-exploitable policies in NSGs. NSGZero incorporates three neural networks, i.e., the dynamics network, the value network, and the prior network, to facilitate efficient Monte Carlo tree search (MCTS) in NSGs. Furthermore, we integrate decentralized control into neural MCTS, enabling NSGZero to handle NSGs with a large number of security resources. Extensive experiments on diverse NSGs with various graph structures and scales demonstrate the superior performance of NSGZero, even with limited training experiences. The next focus of this thesis is on addressing the problem of robust communication in multi-agent communicative reinforcement learning (MACRL), which is a topic that has been largely neglected before. We provide a formal definition of adversarial communication and propose an effective method for modeling message attacks in MACRL. We design a two-stage message filter to defend against message attacks. To enhance robustness, we formulate the adversarial communication problem as a two-player zero-sum game and design the algorithm, R-MACRL, to solve the game. Extensive experiments across different algorithms and tasks reveal the vulnerability of state-of-the-art MACRL methods to message attacks, while our proposed algorithm consistently restores the multi-agent cooperation and improves the robustness of MACRL algorithms under message attacks. Furthermore, we investigate the problem of adapting recommendation policies to newly collected data for optimizing long-term user engagement in sequential recommendation. Two reinforcement learning algorithms are developed to learn policies with and without explicitly designed rewards. First, we introduce ResAct, an algorithm that improves the performance of recommender systems with pre-defined rewards. ResAct reconstructs the behavior of the online-serving policy and enhances it by adding a residual to the actions, resulting in a policy that closely aligns with the original policy but performs better. To improve the expressiveness and conciseness of state representations, we design two information-theoretical regularizers. Empirical evaluation demonstrates that ResAct outperforms previous state-of-the-art algorithms across all tasks. Additionally, we propose PrefRec which learns recommendation policies from preferences between users’ historical behaviors rather than pre-defined rewards. This approach leverages the strengths of RL, such as optimizing long-term goals, while avoiding the complexities of reward engineering. PrefRec automatically learns a reward function from preferences and uses it to generate reinforcement signals for training the recommendation policy. We design an effective optimization method for PrefRec, utilizing an additional value function, expectile regression, and reward function pre-training to enhance performance. Experimental results highlight the significant performance improvements of PrefRec over the current state-of-the-art across various long-term user engagement optimization tasks.
author2	Bo An
author_facet	Bo An Xue, Wanqi
format	Thesis-Doctor of Philosophy
author	Xue, Wanqi
author_sort	Xue, Wanqi
title	Robust and adaptive decision-making: a reinforcement learning perspective
title_short	Robust and adaptive decision-making: a reinforcement learning perspective
title_full	Robust and adaptive decision-making: a reinforcement learning perspective
title_fullStr	Robust and adaptive decision-making: a reinforcement learning perspective
title_full_unstemmed	Robust and adaptive decision-making: a reinforcement learning perspective
title_sort	robust and adaptive decision-making: a reinforcement learning perspective
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/173125
_version_	1789968689395662848
spelling	sg-ntu-dr.10356-1731252024-02-01T09:53:44Z Robust and adaptive decision-making: a reinforcement learning perspective Xue, Wanqi Bo An School of Computer Science and Engineering boan@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence How to make decisions in complex and uncertain environments is a challenging and crucial task. Adversaries and perturbations in these environments disrupt existing policies, while the dynamic nature of the environments renders policies obsolete. Therefore, it is vital to learn robust policies capable of maintaining optimal performance in extreme scenarios and quickly adapting to changes. In this thesis, we focus on three real-world problem domains: network security games (NSGs), inter-agent communication and sequential recommendation. All of these domains necessitates robust and adaptive decision-making. Our first emphasis is to learn robust defending policies in NSGs. In this domain, two algorithms are designed to improve scalability and data efficiency, respectively. First, we propose NSG-NFSP, a novel approach that aims to find Nash equilibria in NSGs on a large scale. NSG-NFSP employs deep neural networks to learn mappings from state-action pairs to values, representing either Q-values or probabilities. NSG-NFSP surpasses state-of-the-art algorithms in terms of scalability and solution quality. Second, we introduce NSGZero, a data-efficient learning method for acquiring non-exploitable policies in NSGs. NSGZero incorporates three neural networks, i.e., the dynamics network, the value network, and the prior network, to facilitate efficient Monte Carlo tree search (MCTS) in NSGs. Furthermore, we integrate decentralized control into neural MCTS, enabling NSGZero to handle NSGs with a large number of security resources. Extensive experiments on diverse NSGs with various graph structures and scales demonstrate the superior performance of NSGZero, even with limited training experiences. The next focus of this thesis is on addressing the problem of robust communication in multi-agent communicative reinforcement learning (MACRL), which is a topic that has been largely neglected before. We provide a formal definition of adversarial communication and propose an effective method for modeling message attacks in MACRL. We design a two-stage message filter to defend against message attacks. To enhance robustness, we formulate the adversarial communication problem as a two-player zero-sum game and design the algorithm, R-MACRL, to solve the game. Extensive experiments across different algorithms and tasks reveal the vulnerability of state-of-the-art MACRL methods to message attacks, while our proposed algorithm consistently restores the multi-agent cooperation and improves the robustness of MACRL algorithms under message attacks. Furthermore, we investigate the problem of adapting recommendation policies to newly collected data for optimizing long-term user engagement in sequential recommendation. Two reinforcement learning algorithms are developed to learn policies with and without explicitly designed rewards. First, we introduce ResAct, an algorithm that improves the performance of recommender systems with pre-defined rewards. ResAct reconstructs the behavior of the online-serving policy and enhances it by adding a residual to the actions, resulting in a policy that closely aligns with the original policy but performs better. To improve the expressiveness and conciseness of state representations, we design two information-theoretical regularizers. Empirical evaluation demonstrates that ResAct outperforms previous state-of-the-art algorithms across all tasks. Additionally, we propose PrefRec which learns recommendation policies from preferences between users’ historical behaviors rather than pre-defined rewards. This approach leverages the strengths of RL, such as optimizing long-term goals, while avoiding the complexities of reward engineering. PrefRec automatically learns a reward function from preferences and uses it to generate reinforcement signals for training the recommendation policy. We design an effective optimization method for PrefRec, utilizing an additional value function, expectile regression, and reward function pre-training to enhance performance. Experimental results highlight the significant performance improvements of PrefRec over the current state-of-the-art across various long-term user engagement optimization tasks. Doctor of Philosophy 2024-01-16T08:38:55Z 2024-01-16T08:38:55Z 2023 Thesis-Doctor of Philosophy Xue, W. (2023). Robust and adaptive decision-making: a reinforcement learning perspective. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173125 https://hdl.handle.net/10356/173125 10.32657/10356/173125 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Robust and adaptive decision-making: a reinforcement learning perspective

相似書籍