Related papers: Local Policy Improvement for Recommender Systems

Local Policy Improvement for Recommender Systems

URL: http://arxiv.org/abs/2212.11431v2
Date: Wed, 26 Apr 2023 22:49:15 GMT
Title: Local Policy Improvement for Recommender Systems
Authors: Dawen Liang, Nikos Vlassis
Abstract summary: We show how to train a new policy given data collected from a previously-deployed policy. We suggest an alternative approach of local policy improvement without off-policy correction. This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently.
Score: 8.617221361305901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recommender systems predict what items a user will interact with next, based on their past interactions. The problem is often approached through supervised learning, but recent advancements have shifted towards policy optimization of rewards (e.g., user engagement). One challenge with the latter is policy mismatch: we are only able to train a new policy given data collected from a previously-deployed policy. The conventional way to address this problem is through importance sampling correction, but this comes with practical limitations. We suggest an alternative approach of local policy improvement without off-policy correction. Our method computes and optimizes a lower bound of expected reward of the target policy, which is easy to estimate from data and does not involve density ratios (such as those appearing in importance sampling correction). This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently. We provide empirical evidence and practical recipes for applying our technique in a sequential recommendation setting.

Related papers

Customize Multi-modal RAI Guardrails with Precedent-based predictions [55.63757336900865]
A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
arXiv Detail & Related papers (2025-07-28T03:45:34Z)
EXPO: Stable Reinforcement Learning with Expressive Policies [74.30151915786233]
We propose a sample-efficient online reinforcement learning algorithm to maximize value with two parameterized policies.<n>Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods.
arXiv Detail & Related papers (2025-07-10T17:57:46Z)
Forward KL Regularized Preference Optimization for Aligning Diffusion Policies [8.958830452149789]
A central problem for learning diffusion policies is to align the policy output with human intents in various tasks. We propose a novel framework, Forward KL regularized Preference optimization, to align the diffusion policy with preferences directly. The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.
arXiv Detail & Related papers (2024-09-09T13:56:03Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
IOB: Integrating Optimization Transfer and Behavior Transfer for Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task. Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions. We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z)
Offline Reinforcement Learning with Closed-Form Policy Improvement Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. In this paper, we propose our closed-form policy improvement operators. We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z)
CUP: Critic-Guided Policy Reuse [37.12379523150601]
Critic-gUided Policy reuse (CUP) is a policy reuse algorithm that avoids training any extra components and efficiently reuses source policies. CUP selects the source policy that has the largest one-step improvement over the current target policy, and forms a guidance policy. Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.
arXiv Detail & Related papers (2022-10-15T00:53:03Z)
Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning. We form a second trust region through the construction of another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z)
An Alternate Policy Gradient Estimator for Softmax Policies [36.48028448548086]
We propose a novel policy gradient estimator for softmax policies. Our analysis and experiments, conducted on bandits and classical MDP benchmarking tasks, show that our estimator is more robust to policy saturation.
arXiv Detail & Related papers (2021-12-22T02:01:19Z)
Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance. Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
First Order Constrained Optimization in Policy Space [19.00289722198614]
We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) FOCOPS maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
arXiv Detail & Related papers (2020-02-16T05:07:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.