Value Penalized Q-Learning for Recommender Systems
- URL: http://arxiv.org/abs/2110.07923v1
- Date: Fri, 15 Oct 2021 08:08:28 GMT
- Title: Value Penalized Q-Learning for Recommender Systems
- Authors: Chengqian Gao, Ke Xu, Peilin Zhao
- Abstract summary: Scaling reinforcement learning to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS.
A key approach to this goal is offline RL, which aims to learn policies from logged data.
We propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm.
- Score: 30.704083806571074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling reinforcement learning (RL) to recommender systems (RS) is promising
since maximizing the expected cumulative rewards for RL agents meets the
objective of RS, i.e., improving customers' long-term satisfaction. A key
approach to this goal is offline RL, which aims to learn policies from logged
data. However, the high-dimensional action space and the non-stationary
dynamics in commercial RS intensify distributional shift issues, making it
challenging to apply offline RL methods to RS. To alleviate the action
distribution shift problem in extracting RL policy from static trajectories, we
propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL
algorithm. It penalizes the unstable Q-values in the regression target by
uncertainty-aware weights, without the need to estimate the behavior policy,
suitable for RS with a large number of items. We derive the penalty weights
from the variances across an ensemble of Q-functions. To alleviate
distributional shift issues at test time, we further introduce the critic
framework to integrate the proposed method with classic RS models. Extensive
experiments conducted on two real-world datasets show that the proposed method
could serve as a gain plugin for existing RS models.
Related papers
- Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales [13.818149654692863]
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance.
In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss.
arXiv Detail & Related papers (2024-05-27T19:28:33Z) - Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems [17.750449033873036]
Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications.
Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets.
Recent advancements in offline RLRS provide a solution for how to address these two challenges.
arXiv Detail & Related papers (2024-03-26T12:08:58Z) - A Perspective of Q-value Estimation on Offline-to-Online Reinforcement
Learning [54.48409201256968]
offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples.
Most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples.
arXiv Detail & Related papers (2023-12-12T19:24:35Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z) - Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for
Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations.
We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.