Scalable multi-agent reinforcement learning for aggregation systems
Efficient sequential matching of supply and demand is a problem of interest in many online to offline services. For instance, Uber, Lyft, Grab for matching taxis to customers; Ubereats, Deliveroo, FoodPanda etc. for matching restaurants to customers. In these systems, a centralized entity (e.g., Ube...
Saved in:
Main Author: | |
---|---|
Format: | text |
Language: | English |
Published: |
Institutional Knowledge at Singapore Management University
2020
|
Subjects: | |
Online Access: | https://ink.library.smu.edu.sg/etd_coll/279 https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1279&context=etd_coll |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Singapore Management University |
Language: | English |
Summary: | Efficient sequential matching of supply and demand is a problem of interest in many online to offline services. For instance, Uber, Lyft, Grab for matching taxis to customers; Ubereats, Deliveroo, FoodPanda etc. for matching restaurants to customers. In these systems, a centralized entity (e.g., Uber) aggregates supply and assigns them to demand so as to optimize a central metric such as profit, number of requests, delay etc. However, individuals (e.g., drivers, delivery boys) in the system are self interested and they try to maximize their own long term profit. The central entity has the full view of the system and it can learn policies to maximize the overall payoff and suggest it to the individuals. However, due to the selfish nature of the individuals, they might not be interested in following the suggestion. Hence, in my thesis, I develop approaches that learn to guide these individuals such that their long term revenue is maximized. There are three key characteristics of the aggregation systems which make them unique from other multi-agent systems. First, there are thousands or tens of thousands of individuals present in the system. Second, the outcome of an interaction is anonymous, i.e., the outcome is dependent only on the number and not on the identities of the agents. And third, there is a centralized entity present which has the full view of the system, but its objective does not align with the objectives of the individuals. These characteristics of the aggregation systems make the use of the existing Multi-Agent Reinforcement Learning (MARL) methods challenging as they are either meant for just a few agents or assume some prior belief about others. A natural question to ask is whether individuals can utilize these features and learn efficient policies to maximize their own long term payoffs. My thesis research focuses on answering this question and provide scalable reinforcement learning methods in aggregation systems. Utilizing the presence of a centralized entity for decentralized learning in a non-cooperative setting is not new and existing MARL methods can be classified based on how much extra information related to the environment state and joint action is provided to the individual learners. However, presence of a self-interested centralized entity adds a new dimension to the learning problem. In the setting of an aggregation system, the centralized entity can learn from the overall experiences of the individuals and might want to reveal only those information which helps in achieving its own objective. Therefore, in my work I propose approaches by considering multiple combinations of levels of information sharing and levels of learning done by the centralized entity. My first contribution assumes that the individuals do not receive any extra information and learn from their local observation. It is a fully decentralized learning method where independent agents learn from the offline trajectories by considering that others are following stationary policies. In my next work, the individuals utilize the anonymity feature of the domain and consider the number of other agents present in their local observation to improve their learning. By increasing the level of learning done by the centralized entity, in my next contribution I provide an equilibrium learning method where the centralized entity suggests a variance minimization policy which is learned based on the values of actions estimated by the individuals. By further increasing the level of information shared and the level of learning done by the centralized entity, I next provide a learning method where the centralized entity acts as an correlation agent. In this method the centralized entity learns social welfare maximization policy directly from the experiences of the individuals and suggests it to the individual agents. The individuals in turn learn a best response policy to the suggested social welfare maximization policy. In my last contribution I propose an incentive based learning approach where the central agent provides incentives to the individuals such that their learning converges to a policy which maximizes overall system performance. Experimental results on real-world data sets and multiple synthetic data sets demonstrate that these approaches outperform other state-of-the-art approaches both in the terms of individual payoffs and overall social welfare payoff of the system. |
---|