Improving Long-Term Metrics in Recommendation Systems using
Short-Horizon Offline RL
- URL: http://arxiv.org/abs/2106.00589v1
- Date: Tue, 1 Jun 2021 15:58:05 GMT
- Title: Improving Long-Term Metrics in Recommendation Systems using
Short-Horizon Offline RL
- Authors: Bogdan Mazoure, Paul Mineiro, Pavithra Srinath, Reza Sharifi Sedeh,
Doina Precup, Adith Swaminathan
- Abstract summary: We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility.
We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions.
- Score: 56.20835219296896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study session-based recommendation scenarios where we want to recommend
items to users during sequential interactions to improve their long-term
utility. Optimizing a long-term metric is challenging because the learning
signal (whether the recommendations achieved their desired goals) is delayed
and confounded by other user interactions with the system. Immediately
measurable proxies such as clicks can lead to suboptimal recommendations due to
misalignment with the long-term metric. Many works have applied episodic
reinforcement learning (RL) techniques for session-based recommendation but
these methods do not account for policy-induced drift in user intent across
sessions. We develop a new batch RL algorithm called Short Horizon Policy
Improvement (SHPI) that approximates policy-induced distribution shifts across
sessions. By varying the horizon hyper-parameter in SHPI, we recover well-known
policy improvement schemes in the RL literature. Empirical results on four
recommendation tasks show that SHPI can outperform matrix factorization,
offline bandits, and offline RL baselines. We also provide a stable and
computationally efficient implementation using weighted regression oracles.
Related papers
- An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation [14.506332665769746]
We propose an underlinetextbfEfficient underlinetextbfContinuous underlinetextbfControl framework (ECoC)
Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces.
During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions.
arXiv Detail & Related papers (2024-08-15T09:26:26Z) - Learning Goal-Conditioned Policies from Sub-Optimal Offline Data via Metric Learning [22.174803826742963]
We address the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning.
We propose the use of metric learning to approximate the optimal value function for goal-conditioned offline RL problems.
We show that our method estimates optimal behaviors from severely sub-optimal offline datasets without suffering from out-of-distribution estimation errors.
arXiv Detail & Related papers (2024-02-16T16:46:53Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term
User Engagement [25.18963930580529]
We introduce a novel paradigm called Adaptive Sequential Recommendation (AdaRec) to address this issue.
AdaRec proposes a new distance-based representation loss to extract latent information from users' interaction trajectories.
We conduct extensive empirical analyses in both simulator-based and live sequential recommendation tasks.
arXiv Detail & Related papers (2023-10-06T02:45:21Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - ResAct: Reinforcing Long-term Engagement in Sequential Recommendation
with Residual Actor [36.0251263322305]
ResAct seeks a policy that is close to, but better than, the online-serving policy.
We conduct experiments on a benchmark dataset and a large-scale industrial dataset.
Results show that our method significantly outperforms the state-of-the-art baselines in various long-term engagement optimization tasks.
arXiv Detail & Related papers (2022-06-01T02:45:35Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.