Multi-agent reinforcement learning for complex sequential decision-making
Many tasks involve multiple agents and require sequential decision-making policies to achieve common goals, such as football games, real-time strategy games, and traffic light control in the road network. To obtain the policies of all agents, these problems can be modeled as multi-agent systems and...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/173035 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Many tasks involve multiple agents and require sequential decision-making policies to achieve common goals, such as football games, real-time strategy games, and traffic light control in the road network. To obtain the policies of all agents, these problems can be modeled as multi-agent systems and solved with multi-agent reinforcement learning (MARL). However, optimizing policies in multi-agent scenarios is non-trivial due to complex multi-agent behaviors and the non-stationary nature of the environments’ complex dynamics. Agents’ behaviors and interactions with other agents cause the environment’s states and agents’ observations to change over time, making it challenging to develop effective policies that perform well over time. In addition, partial observability, where agents have limited or incomplete information about the environment, also complicates the problem. Moreover, the inherent uncertainty in the environment’s dynamics makes decision-making unstable.
This doctoral thesis addresses these challenges by proposing novel MARL methods. These novel methods empower agents to learn efficient policies within dynamic and partially observable environments, especially environments where cooperation is required. In particular, we tackle the following four fundamental multi-agent research problems and propose a solution for each.
We start by studying the problem of learning risk-sensitive cooperative policies for agents in risky scenarios characterized by significant potential reward loss due to the execution of potential low-return actions. Particularly, we focus on environments where agent heterogeneity is prevalent within the team, and opponents may outnumber the RL agents. To tackle the problem, we propose RMIX to learn risk-sensitive cooperative policies for MARL. We first model distributions of individuals’ Q values via distributional RL. Then we utilize the Conditional Value at Risk (CVaR) measure over the individual return distribution. We also propose a dynamic risk level optimizer to handle the temporal nature of the stochastic outcomes during execution. Empirically, RMIX shows leading performance over state-of-the-art methods in various multi-agent risk-sensitive scenarios. It demonstrates enhanced coordination and reveals improved sample efficiency.
We then investigate the problem of learning scalable policies in dynamic Electronic Toll Collection (DETC) problems where the traffic network is large and dynamic. To this end, we propose a novel MARL approach for scaling up DETC by decomposing large states into smaller parts and learning a multi-agent policy for each decomposed state with the cooperative MARL method. Specifically, we decompose the graph network into smaller graphs and propose a novel edge-based graph convolutional neural network (eGCN) to extract the spatio-temporal correlations of the road network features. The extracted features are fed into the policy network of the cooperative MARL method. Experimental results show that such a divide-and-conquer approach can scale up to realistic-sized problems with robust performance and significantly outperform the state-of-the-art method.
Thirdly, we focus on learning efficient multi-agent coordination policies in scenarios where actions have durations. With action durations, the rewards are displaced, making training MARL policies challenging with temporal-difference learning. To address this problem, we propose a novel reward redistribution method built on our novel graph-based episodic memory called LeGEM-core to learn efficient multi-agent coordination in environments where off-beat actions are prevalent. Off-beat actions refer to actions that have action durations, during which the environmental changes are influenced by the execution of these actions. LeGEM-core memorizes agents’ past experiences explicitly and enables credit assignment in MARL training. We name our solution method LeGEM. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including the Stag-Hunter Game, Quarry Game, and Afforestation Game. Empirical results show that it significantly boosts multi-agent coordination in multi-agent environments with off-beat actions and achieves leading performance.
Lastly, we aim to learn generalizable policies that enable agents to coordinate with or compete with other agents that possess unseen policies during training. We propose RPM that learns generalizable policies for agents in evaluation scenarios where other agents behave differently. The main idea of RPM is to train MARL policies by gathering massive multi-agent interaction data. We first rank each agent’s policies by its training episode return and then save the ranked policies in the memory; when an episode starts, each agent can randomly select a policy from memory as the behavior policy. This novel self-play framework diversifies multi-agent interaction in the training data and improves the generalization performance of MARL. Experimental results on Melting Pot demonstrate that RPM enables agents to interact with unseen agents in multi-agent generalization evaluation scenarios and gain increased performance.
To conclude, this doctoral thesis investigates four fundamental multi-agent sequential decision-making research problems that are ubiquitous and unsolved. The proposed four MARL methodology solutions achieve efficient policy training and performance for agents in multi-agent environments with uncertainties raised by the potential loss of rewards, the issue of large state space, the action durations, and the lack of generalizability of MARL. |
---|