Towards efficient cooperation within learning agents

A wide range of real world problems, such as the control of autonomous vehicles and drones, packet delivery, and many others consists of a number of agents that need to take actions based on local observations and can thus be formulated in the multi-agent reinforcement learning (MARL) setting. Furth...

Full description

Saved in:
Bibliographic Details
Main Author: Wang, Rundong
Other Authors: Bo An
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/169921
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:A wide range of real world problems, such as the control of autonomous vehicles and drones, packet delivery, and many others consists of a number of agents that need to take actions based on local observations and can thus be formulated in the multi-agent reinforcement learning (MARL) setting. Furthermore, as more machine learning systems are deployed in the real world, they will start having impact on each other, effectively turning most decision making problems into multi-agent cooperation problems. In this doctoral thesis, we develop and evaluate novel deep RL (DRL) methods that address the unique challenges which arise in these settings. These challenges include learning to communicate, to collaborate, and to reciprocate amongst agents. In the first second part of the doctoral thesis, we consider the problem of the limited-bandwidth communication for multi-agent reinforcement learning, where agents cooperate with the assistance of a communication protocol. A key difficulty, faced by a group of learning agents in real-world domains, is the need to efficiently exploit the available communication resources, such as limited bandwidth. To address the limited bandwidth problem, we develop an Informative Multi-Agent Communication (IMAC) method to learn efficient communication protocols by compressing the communication messages. From the perspective of communication theory, we prove that the limited bandwidth constraint requires low-entropy messages throughout the transmission. In IMAC, inspired by the information bottleneck principle, agents are trained to learn a valuable and compact communication protocol. The second part of the doctoral thesis investigates the challenges in hierarchical reinforcement learning (HRL), which is often implemented as a high-level policy assigning subgoals to a low-level policy. HRL suffers the high-level non-stationarity problem since the low-level policy is constantly changing. The non-stationarity also leads to the data efficiency problem: policies need more data at non-stationary states to stabilize training. To address these issues, we propose a novel HRL method: Interactive Influence-based Hierarchical Reinforcement Learning (I2HRL). InI2HRL, we enable the interaction between the low-level and high-level policies, i.e., the low-level policy sends its policy representation to the high-level policy. The key insight here is that “hierarchy” is just a way to assign responsibilities in a complex system, so HRL is actually about “collaboration” among multiple agents and can be interpreted as a form of MARL when we consider each level policy as agent. The state transition function and the reward function of each agent depend on the actions of all agents. Besides, in this part, we consider the specific problem of Fintech: portfolio management via reinforcement learning, which explores how to optimally reallocate a fund into different financial assets over the long term by trial-and-error. Existing methods are impractical since they usually assume each reallocation can be finished immediately and thus ignoring the price slippage as part of the trading cost. To address these issues, we propose a hierarchical reinforced stock trading system for portfolio management (HRPM). Concretely, we decompose the trading process into a hierarchy of portfolio management over trade execution and train the corresponding policies. The high-level policy gives portfolio weights at a lower frequency to maximize the long term profit and invokes the low-level policy to sell or buy the corresponding shares within a short time window at a higher frequency to minimize the trading cost. We train two levels of policies via pre-training scheme and iterative training scheme for data efficiency. Extensive experimental results in the U.S. market and the China market demonstrate that HRPM achieves significant improvement against many state-of-the-art approaches. In the third part of the doctoral thesis, we consider the problem of automatic curriculum learning in MARL, where a teacher and a student will learn from each other and reciprocate each other. Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments. To conclude, this thesis makes progress on the challenges that arise in hierarchical and multi-agent settings and also opens-up a number of exciting questions for future research. These include how agents can learn to account for the learning of other agents when their rewards or observations are unknown, how to learn communication protocols in settings of partial common interest, and how to account for the agency of humans in the environment.