Learning expensive coordination: An event-based deep RL approach

Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e.,...

Full description

Saved in:
Bibliographic Details
Main Authors: YU, Runsheng, WANG, Xinrun, WANG, Rundong, ZHANG, Youzhi, AN, Bo, SHI, Zhen Yu, LAI, Hanjiang
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2020
Subjects:
Online Access:https://ink.library.smu.edu.sg/sis_research/9147
https://ink.library.smu.edu.sg/context/sis_research/article/10150/viewcontent/108_learning_expensive_coordination_av.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:Existing works in deep Multi-Agent Reinforcement Learning (MARL) mainly focus on coordinating cooperative agents to complete certain tasks jointly. However, in many cases of the real world, agents are self-interested such as employees in a company and clubs in a league. Therefore, the leader, i.e., the manager of the company or the league, needs to provide bonuses to followers for efficient coordination, which we call expensive coordination. The main difficulties of expensive coordination are that i) the leader has to consider the long-term effect and predict the followers’ behaviors when assigning bonuses, and ii) the complex interactions between followers make the training process hard to converge, especially when the leader’s policy changes with time. In this work, we address this problem through an event-based deep RL approach. Our main contributions are threefold. (1) We model the leader’s decision-making process as a semi-Markov Decision Process and propose a novel multi-agent event-based policy gradient to learn the leader’s long-term policy. (2) We exploit the leader-follower consistency scheme to design a follower-aware module and a follower-specific attention module to predict the followers’ behaviors and make accurate response to their behaviors. (3) We propose an action abstraction-based policy gradient algorithm to reduce the followers’ decision space and thus accelerate the training process of followers. Experiments in resource collections, navigation, and the predator-prey game reveal that our approach outperforms the state-of-the-art methods dramatically