Multi-agent reinforcement learning for complex sequential decision-making

Many tasks involve multiple agents and require sequential decision-making policies to achieve common goals, such as football games, real-time strategy games, and traffic light control in the road network. To obtain the policies of all agents, these problems can be modeled as multi-agent systems and...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Qiu, Wei
مؤلفون آخرون:	Bo An
التنسيق:	Thesis-Doctor of Philosophy
اللغة:	English
منشور في:	Nanyang Technological University 2024
الموضوعات:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
الوصول للمادة أونلاين:	https://hdl.handle.net/10356/173035
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
المؤسسة:	Nanyang Technological University
اللغة:	English

id	sg-ntu-dr.10356-173035
record_format	dspace
institution	Nanyang Technological University
building	NTU Library
continent	Asia
country	Singapore Singapore
content_provider	NTU Library
collection	DR-NTU
language	English
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Qiu, Wei Multi-agent reinforcement learning for complex sequential decision-making
description	Many tasks involve multiple agents and require sequential decision-making policies to achieve common goals, such as football games, real-time strategy games, and traffic light control in the road network. To obtain the policies of all agents, these problems can be modeled as multi-agent systems and solved with multi-agent reinforcement learning (MARL). However, optimizing policies in multi-agent scenarios is non-trivial due to complex multi-agent behaviors and the non-stationary nature of the environments’ complex dynamics. Agents’ behaviors and interactions with other agents cause the environment’s states and agents’ observations to change over time, making it challenging to develop effective policies that perform well over time. In addition, partial observability, where agents have limited or incomplete information about the environment, also complicates the problem. Moreover, the inherent uncertainty in the environment’s dynamics makes decision-making unstable. This doctoral thesis addresses these challenges by proposing novel MARL methods. These novel methods empower agents to learn efficient policies within dynamic and partially observable environments, especially environments where cooperation is required. In particular, we tackle the following four fundamental multi-agent research problems and propose a solution for each. We start by studying the problem of learning risk-sensitive cooperative policies for agents in risky scenarios characterized by significant potential reward loss due to the execution of potential low-return actions. Particularly, we focus on environments where agent heterogeneity is prevalent within the team, and opponents may outnumber the RL agents. To tackle the problem, we propose RMIX to learn risk-sensitive cooperative policies for MARL. We first model distributions of individuals’ Q values via distributional RL. Then we utilize the Conditional Value at Risk (CVaR) measure over the individual return distribution. We also propose a dynamic risk level optimizer to handle the temporal nature of the stochastic outcomes during execution. Empirically, RMIX shows leading performance over state-of-the-art methods in various multi-agent risk-sensitive scenarios. It demonstrates enhanced coordination and reveals improved sample efficiency. We then investigate the problem of learning scalable policies in dynamic Electronic Toll Collection (DETC) problems where the traffic network is large and dynamic. To this end, we propose a novel MARL approach for scaling up DETC by decomposing large states into smaller parts and learning a multi-agent policy for each decomposed state with the cooperative MARL method. Specifically, we decompose the graph network into smaller graphs and propose a novel edge-based graph convolutional neural network (eGCN) to extract the spatio-temporal correlations of the road network features. The extracted features are fed into the policy network of the cooperative MARL method. Experimental results show that such a divide-and-conquer approach can scale up to realistic-sized problems with robust performance and significantly outperform the state-of-the-art method. Thirdly, we focus on learning efficient multi-agent coordination policies in scenarios where actions have durations. With action durations, the rewards are displaced, making training MARL policies challenging with temporal-difference learning. To address this problem, we propose a novel reward redistribution method built on our novel graph-based episodic memory called LeGEM-core to learn efficient multi-agent coordination in environments where off-beat actions are prevalent. Off-beat actions refer to actions that have action durations, during which the environmental changes are influenced by the execution of these actions. LeGEM-core memorizes agents’ past experiences explicitly and enables credit assignment in MARL training. We name our solution method LeGEM. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including the Stag-Hunter Game, Quarry Game, and Afforestation Game. Empirical results show that it significantly boosts multi-agent coordination in multi-agent environments with off-beat actions and achieves leading performance. Lastly, we aim to learn generalizable policies that enable agents to coordinate with or compete with other agents that possess unseen policies during training. We propose RPM that learns generalizable policies for agents in evaluation scenarios where other agents behave differently. The main idea of RPM is to train MARL policies by gathering massive multi-agent interaction data. We first rank each agent’s policies by its training episode return and then save the ranked policies in the memory; when an episode starts, each agent can randomly select a policy from memory as the behavior policy. This novel self-play framework diversifies multi-agent interaction in the training data and improves the generalization performance of MARL. Experimental results on Melting Pot demonstrate that RPM enables agents to interact with unseen agents in multi-agent generalization evaluation scenarios and gain increased performance. To conclude, this doctoral thesis investigates four fundamental multi-agent sequential decision-making research problems that are ubiquitous and unsolved. The proposed four MARL methodology solutions achieve efficient policy training and performance for agents in multi-agent environments with uncertainties raised by the potential loss of rewards, the issue of large state space, the action durations, and the lack of generalizability of MARL.
author2	Bo An
author_facet	Bo An Qiu, Wei
format	Thesis-Doctor of Philosophy
author	Qiu, Wei
author_sort	Qiu, Wei
title	Multi-agent reinforcement learning for complex sequential decision-making
title_short	Multi-agent reinforcement learning for complex sequential decision-making
title_full	Multi-agent reinforcement learning for complex sequential decision-making
title_fullStr	Multi-agent reinforcement learning for complex sequential decision-making
title_full_unstemmed	Multi-agent reinforcement learning for complex sequential decision-making
title_sort	multi-agent reinforcement learning for complex sequential decision-making
publisher	Nanyang Technological University
publishDate	2024
url	https://hdl.handle.net/10356/173035
_version_	1794549331906265088
spelling	sg-ntu-dr.10356-1730352024-02-05T03:44:31Z Multi-agent reinforcement learning for complex sequential decision-making Qiu, Wei Bo An School of Computer Science and Engineering Lana Obraztsova boan@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Many tasks involve multiple agents and require sequential decision-making policies to achieve common goals, such as football games, real-time strategy games, and traffic light control in the road network. To obtain the policies of all agents, these problems can be modeled as multi-agent systems and solved with multi-agent reinforcement learning (MARL). However, optimizing policies in multi-agent scenarios is non-trivial due to complex multi-agent behaviors and the non-stationary nature of the environments’ complex dynamics. Agents’ behaviors and interactions with other agents cause the environment’s states and agents’ observations to change over time, making it challenging to develop effective policies that perform well over time. In addition, partial observability, where agents have limited or incomplete information about the environment, also complicates the problem. Moreover, the inherent uncertainty in the environment’s dynamics makes decision-making unstable. This doctoral thesis addresses these challenges by proposing novel MARL methods. These novel methods empower agents to learn efficient policies within dynamic and partially observable environments, especially environments where cooperation is required. In particular, we tackle the following four fundamental multi-agent research problems and propose a solution for each. We start by studying the problem of learning risk-sensitive cooperative policies for agents in risky scenarios characterized by significant potential reward loss due to the execution of potential low-return actions. Particularly, we focus on environments where agent heterogeneity is prevalent within the team, and opponents may outnumber the RL agents. To tackle the problem, we propose RMIX to learn risk-sensitive cooperative policies for MARL. We first model distributions of individuals’ Q values via distributional RL. Then we utilize the Conditional Value at Risk (CVaR) measure over the individual return distribution. We also propose a dynamic risk level optimizer to handle the temporal nature of the stochastic outcomes during execution. Empirically, RMIX shows leading performance over state-of-the-art methods in various multi-agent risk-sensitive scenarios. It demonstrates enhanced coordination and reveals improved sample efficiency. We then investigate the problem of learning scalable policies in dynamic Electronic Toll Collection (DETC) problems where the traffic network is large and dynamic. To this end, we propose a novel MARL approach for scaling up DETC by decomposing large states into smaller parts and learning a multi-agent policy for each decomposed state with the cooperative MARL method. Specifically, we decompose the graph network into smaller graphs and propose a novel edge-based graph convolutional neural network (eGCN) to extract the spatio-temporal correlations of the road network features. The extracted features are fed into the policy network of the cooperative MARL method. Experimental results show that such a divide-and-conquer approach can scale up to realistic-sized problems with robust performance and significantly outperform the state-of-the-art method. Thirdly, we focus on learning efficient multi-agent coordination policies in scenarios where actions have durations. With action durations, the rewards are displaced, making training MARL policies challenging with temporal-difference learning. To address this problem, we propose a novel reward redistribution method built on our novel graph-based episodic memory called LeGEM-core to learn efficient multi-agent coordination in environments where off-beat actions are prevalent. Off-beat actions refer to actions that have action durations, during which the environmental changes are influenced by the execution of these actions. LeGEM-core memorizes agents’ past experiences explicitly and enables credit assignment in MARL training. We name our solution method LeGEM. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including the Stag-Hunter Game, Quarry Game, and Afforestation Game. Empirical results show that it significantly boosts multi-agent coordination in multi-agent environments with off-beat actions and achieves leading performance. Lastly, we aim to learn generalizable policies that enable agents to coordinate with or compete with other agents that possess unseen policies during training. We propose RPM that learns generalizable policies for agents in evaluation scenarios where other agents behave differently. The main idea of RPM is to train MARL policies by gathering massive multi-agent interaction data. We first rank each agent’s policies by its training episode return and then save the ranked policies in the memory; when an episode starts, each agent can randomly select a policy from memory as the behavior policy. This novel self-play framework diversifies multi-agent interaction in the training data and improves the generalization performance of MARL. Experimental results on Melting Pot demonstrate that RPM enables agents to interact with unseen agents in multi-agent generalization evaluation scenarios and gain increased performance. To conclude, this doctoral thesis investigates four fundamental multi-agent sequential decision-making research problems that are ubiquitous and unsolved. The proposed four MARL methodology solutions achieve efficient policy training and performance for agents in multi-agent environments with uncertainties raised by the potential loss of rewards, the issue of large state space, the action durations, and the lack of generalizability of MARL. Doctor of Philosophy 2024-01-10T00:50:20Z 2024-01-10T00:50:20Z 2023 Thesis-Doctor of Philosophy Qiu, W. (2023). Multi-agent reinforcement learning for complex sequential decision-making. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/173035 https://hdl.handle.net/10356/173035 10.32657/10356/173035 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University

Multi-agent reinforcement learning for complex sequential decision-making

مواد مشابهة