Fast Offline Policy Optimization for Large Scale Recommendation
- URL: http://arxiv.org/abs/2208.05327v4
- Date: Sat, 27 May 2023 07:44:41 GMT
- Title: Fast Offline Policy Optimization for Large Scale Recommendation
- Authors: Otmane Sakhi, David Rohde, Alexandre Gilotte
- Abstract summary: We derive an approximation of these policy learning algorithms that scale logarithmically with the catalogue size.
Our contribution is based upon combining three novel ideas.
Our estimator is an order of magnitude faster than naive approaches yet produces equally good policies.
- Score: 74.78213147859236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personalised interactive systems such as recommender systems require
selecting relevant items from massive catalogs dependent on context.
Reward-driven offline optimisation of these systems can be achieved by a
relaxation of the discrete problem resulting in policy learning or REINFORCE
style learning algorithms. Unfortunately, this relaxation step requires
computing a sum over the entire catalogue making the complexity of the
evaluation of the gradient (and hence each stochastic gradient descent
iterations) linear in the catalogue size. This calculation is untenable in many
real world examples such as large catalogue recommender systems, severely
limiting the usefulness of this method in practice. In this paper, we derive an
approximation of these policy learning algorithms that scale logarithmically
with the catalogue size. Our contribution is based upon combining three novel
ideas: a new Monte Carlo estimate of the gradient of a policy, the self
normalised importance sampling estimator and the use of fast maximum inner
product search at training time. Extensive experiments show that our algorithm
is an order of magnitude faster than naive approaches yet produces equally good
policies.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Optimizing Solution-Samplers for Combinatorial Problems: The Landscape
of Policy-Gradient Methods [52.0617030129699]
We introduce a novel theoretical framework for analyzing the effectiveness of DeepMatching Networks and Reinforcement Learning methods.
Our main contribution holds for a broad class of problems including Max-and Min-Cut, Max-$k$-Bipartite-Bi, Maximum-Weight-Bipartite-Bi, and Traveling Salesman Problem.
As a byproduct of our analysis we introduce a novel regularization process over vanilla descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
arXiv Detail & Related papers (2023-10-08T23:39:38Z) - Fast Slate Policy Optimization: Going Beyond Plackett-Luce [7.366405857677226]
This paper addresses the optimization of large scale decision systems given an arbitrary reward function.
We propose a new class of policies, born from a novel relaxation of decision functions.
This results in a simple, yet efficient learning algorithm that scales to massive action spaces.
arXiv Detail & Related papers (2023-08-03T07:13:27Z) - Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics [5.270497591225775]
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward.
Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space.
We propose a novel algorithm for constrained RL that does not suffer from these limitations.
arXiv Detail & Related papers (2022-12-03T01:54:55Z) - Learning-to-Rank at the Speed of Sampling: Plackett-Luce Gradient
Estimation With Minimal Computational Complexity [13.579420996461439]
We introduce the novel PL-Rank-3 algorithm that performs unbiased gradient estimation with a computational complexity comparable to the best sorting algorithms.
Our experimental results indicate large gains in the time required for optimization, without any loss in performance.
arXiv Detail & Related papers (2022-04-22T18:01:33Z) - Contextual Exploration Using a Linear Approximation Method Based on
Satisficing [0.0]
The amount of exploration required for learning is often quite large.
Deep reinforcement learning also has super-human performance in that no human being would be able to achieve such amounts of exploration.
We propose Linear RS (LinRS), which is a type of satisficing algorithm and a linear extension of risk-sensitive satisficing (RS)
arXiv Detail & Related papers (2021-12-13T07:14:01Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Learning the Step-size Policy for the Limited-Memory
Broyden-Fletcher-Goldfarb-Shanno Algorithm [3.7470451129384825]
We consider the problem of how to learn a step-size policy for the Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm.
We propose a neural network architecture with local information of the current gradient as the input.
The step-length policy is learned from data of similar optimization problems, avoids additional evaluations of the objective function, and guarantees that the output step remains inside a pre-defined interval.
arXiv Detail & Related papers (2020-10-03T09:34:03Z) - Optimization of Graph Total Variation via Active-Set-based Combinatorial
Reconditioning [48.42916680063503]
We propose a novel adaptive preconditioning strategy for proximal algorithms on this problem class.
We show that nested-forest decomposition of the inactive edges yields a guaranteed local linear convergence rate.
Our results suggest that local convergence analysis can serve as a guideline for selecting variable metrics in proximal algorithms.
arXiv Detail & Related papers (2020-02-27T16:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.