Credit assignment in multiagent reinforcement learning for large agent population

In the current age, rapid growth in sectors like finance, transportation etc., involve fast digitization of industrial processes. This creates a huge opportunity for next-generation artificial intelligence system with multiple agents operating at scale. Multiagent reinforcement learning (MARL) is th...

Full description

Saved in:
Bibliographic Details
Main Author: SINGH, Arambam James
Format: text
Language:English
Published: Institutional Knowledge at Singapore Management University 2021
Subjects:
Online Access:https://ink.library.smu.edu.sg/etd_coll/364
https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1362&context=etd_coll
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Singapore Management University
Language: English
Description
Summary:In the current age, rapid growth in sectors like finance, transportation etc., involve fast digitization of industrial processes. This creates a huge opportunity for next-generation artificial intelligence system with multiple agents operating at scale. Multiagent reinforcement learning (MARL) is the field of study that addresses problems in the multiagent systems. In this thesis, we develop and evaluate novel MARL methodologies that address the challenges in large scale multiagent system with cooperative setting. One of the key challenge in cooperative MARL is the problem of credit assignment. Many of the previous approaches to the problem relies on agent's individual trajectory which makes scalability limited to small number of agents. Our proposed methodologies are solely based on aggregate information which provides the benefit of high scalability. The dimension of key statistics does not change with increasing agent population size. In this thesis we also address other challenges that arise in MARL such as variable duration action, and also some preliminary work on credit assignment with sparse reward model. The first part of this thesis investigates the challenges in a maritime traffic management (MTM) problem, one of the motivating domains for large scale cooperative multiagent systems. The key research question is how to coordinate vessels in a heavily trafficked maritime traffic environment to increase the safety of navigation by reducing traffic congestions. MTM problem is an instance of cooperative MARL with shared reward. Vessels share the same penalty cost for any congestions. Thus, it suffer from the credit assignment problem. We address it by developing a vessel-based value function using aggregate information, which performs effective credit assignment by computing the effectiveness of the agent’s policy by filtering out the contributions from other agents. Although this first approach achieved promising results, its ability to handle variable duration action is rather limited, which is a crucial feature of the problem domain. Thus, we address this challenge using hierarchical reinforcement learning, a framework for control with variable duration action. We develop a novel hierarchical learning based approach for the maritime traffic control problem. We introduce a notion of meta action a high level action that takes variable amount time to execute. We also propose an individual meta value function using aggregate information which effectively address the credit assignment problem. We also develop a general approach to address the credit assignment problem for a large scale cooperative multiagent system for both discrete and continuous actions settings. We extended a shaped reward approach known as difference rewards (DR) to address the credit assignment problem. DRs are an effective tool to tackle this problem, but their computation is known to be challenging even for small number of agents. We propose a scalable method to compute difference rewards based on the aggregate information. One limitation of this DR based approach for credit assignment is that it relies on learning a good approximation of reward model. But, in a sparse reward setting agents do not receive any informative immediate reward signal until the episode ends, so this shaped reward based approach is not effective in sparse reward case. In this thesis, we also propose some preliminary work in this direction.