Robust multi-agent team behaviors in uncertain environment via reinforcement learning
Many state-of-the-art cooperative multi-agent reinforcement learning (MARL) approaches, such as MADDPG, COMA, and QMIX have focused mainly on performing well in idealized scenarios. Agents face similar environmental conditions and opponents encountered during training. The resulting policies are of...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Master by Research |
Language: | English |
Published: |
Nanyang Technological University
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/159448 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Many state-of-the-art cooperative multi-agent reinforcement learning (MARL) approaches, such as MADDPG, COMA, and QMIX have focused mainly on performing well in idealized scenarios. Agents face similar environmental conditions and opponents encountered during training. The resulting policies are often fragile and brittle from overfitting to the training environment. These policies cannot be easily deployed out of the laboratory.
While adversarial learning is a way to train robust policies, many of these works have focused on single-agent RL and adversarial updates to the static environment. Some robust MARL works are designed based on adversarial training. These works have focused on specialized settings. M3DDPG focuses on an extreme setting in which all other agents are assumed to be adversarial. Phan et al. looked at the setting where agents malfunction and turn adversarial. Many of these works have compromised on team coordination to achieve robustness. There is little emphasis on maintaining good team coordination while ensuring robustness. This is an obvious gap where robustness should be part of the MARL algorithm design objectives besides performance, rather than an afterthought.
This work focuses on learning robust team policy that would perform well even when the environment and opponent behaviour is significantly different from training. We propose the Signal-mediated Team Maxmin (STeaM) framework. STeaM is an end-to-end MARL framework that approximates the game-theoretic solution concept of team-maxmin equilibrium with a correlation device (TMECor), to address issues of agent coordination and policy robustness. STeaM uses a pre-agreed signal to coordinate team actions and approximate TMECor policies through consistency and diversity regularizations together with a best-response gradient descent self-play equilibrium learning procedure.
Our experiments show that STeaM can learn team agent policies that approximate TMECor well. These policies can consistently achieve higher rewards in adversarial and uncertain situations over policies produced by other state-of-art models. The STeaM produced policies also exhibit bounded performance degradation when tested previously unseen policies. |
---|