Recommendation via reinforcement learning methods
Recommender system has been a persistent research goal for decades, which aims at recommending suitable items such as movies to users. Supervised learning methods are widely adopted by modeling recommendation problems as prediction tasks. However, with the rise of online e-commerce platforms, variou...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis-Doctor of Philosophy |
Language: | English |
Published: |
Nanyang Technological University
2021
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/152271 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | Recommender system has been a persistent research goal for decades, which aims at recommending suitable items such as movies to users. Supervised learning methods are widely adopted by modeling recommendation problems as prediction tasks. However, with the rise of online e-commerce platforms, various scenarios appear, which allow users to make sequential decisions rather than one-time decisions. Therefore, reinforcement learning methods have attracted increasing attention in recent years to solve these problems.
This doctoral thesis is devoted to investigating some recommendation settings that can be solved by reinforcement learning methods, including multi-arm bandit and multi-agent reinforcement learning.
For the recommendation domain, most scenarios only involve a single agent that generates recommended items to users aiming at maximizing some metrics like click-through rate (CTR). Since candidate items change all the time in many online recommendation scenarios, one crucial issue is the trade-off between exploration and exploitation. Thus, we consider multi-arm bandit problems, a special topic in online learning and reinforcement learning to balance exploration and exploitation. We propose two methods to alleviate issues in recommendation problems.
Firstly, we consider how users give feedback to items or actions chosen by an agent. Previous works rarely consider the uncertainty when humans provide feedback, especially in cases that the optimal actions are not obvious to the users. For example, when similar items are recommended to a user, the user is likely to provide positive feedback to suboptimal items, negative feedback to the optimal item and even do not provide feedback
in some confusing situations. To involve uncertainties in the learning environment and human feedback, we introduce a feedback model. Moreover, a novel method is proposed to nd the optimal policy and proper feedback model simultaneously. Secondly, for the online recommendation in mobile devices, positions of items have a significant influence on clicks due to the limited screen size of mobile devices: 1) Higher positions lead to more clicks for one commodity. 2) The `pseudo-exposure' issue: Only a few recommended items are shown at first glance and users need to slide the screen to browse other items. Therefore, some recommended items ranked behind are not viewed by users and it is not proper to treat these items as negative samples. To address these two issues, we model the online recommendation as a contextual combinatorial bandit problem and define the reward of a recommended set. Then, we propose a novel contextual combinatorial bandit method and provide a formal regret analysis. An online experiment is implemented in Taobao, one of the most popular e-commerce platforms
in the world. Results on two metrics show that our algorithm outperforms the other contextual bandit algorithms.
For multi-agent reinforcement learning setting, we focus on a kind of recommendation scenario in online e-commerce platforms, which involves multiple modules to recommend items with different properties such as huge discounts. A web page often consists of some independent modules. The ranking policies of these modules are decided by different teams and optimized individually without cooperation, which would result in competition between modules. Thus, the global policy of the whole page could be sub-optimal. To address this issue, we propose a novel multi-agent cooperative reinforcement learning approach with the restriction that modules cannot communicate with others. Experiments based on real-world e-commerce data demonstrate that our algorithm obtains superior performance over baselines. |
---|