Local Policy Improvement for Recommender Systems
- URL: http://arxiv.org/abs/2212.11431v2
- Date: Wed, 26 Apr 2023 22:49:15 GMT
- Title: Local Policy Improvement for Recommender Systems
- Authors: Dawen Liang, Nikos Vlassis
- Abstract summary: We show how to train a new policy given data collected from a previously-deployed policy.
We suggest an alternative approach of local policy improvement without off-policy correction.
This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently.
- Score: 8.617221361305901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recommender systems predict what items a user will interact with next, based
on their past interactions. The problem is often approached through supervised
learning, but recent advancements have shifted towards policy optimization of
rewards (e.g., user engagement). One challenge with the latter is policy
mismatch: we are only able to train a new policy given data collected from a
previously-deployed policy. The conventional way to address this problem is
through importance sampling correction, but this comes with practical
limitations. We suggest an alternative approach of local policy improvement
without off-policy correction. Our method computes and optimizes a lower bound
of expected reward of the target policy, which is easy to estimate from data
and does not involve density ratios (such as those appearing in importance
sampling correction). This local policy improvement paradigm is ideal for
recommender systems, as previous policies are typically of decent quality and
policies are updated frequently. We provide empirical evidence and practical
recipes for applying our technique in a sequential recommendation setting.
Related papers
- Forward KL Regularized Preference Optimization for Aligning Diffusion Policies [8.958830452149789]
A central problem for learning diffusion policies is to align the policy output with human intents in various tasks.
We propose a novel framework, Forward KL regularized Preference optimization, to align the diffusion policy with preferences directly.
The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.
arXiv Detail & Related papers (2024-09-09T13:56:03Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - CUP: Critic-Guided Policy Reuse [37.12379523150601]
Critic-gUided Policy reuse (CUP) is a policy reuse algorithm that avoids training any extra components and efficiently reuses source policies.
CUP selects the source policy that has the largest one-step improvement over the current target policy, and forms a guidance policy.
Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.
arXiv Detail & Related papers (2022-10-15T00:53:03Z) - Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning.
We form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z) - An Alternate Policy Gradient Estimator for Softmax Policies [36.48028448548086]
We propose a novel policy gradient estimator for softmax policies.
Our analysis and experiments, conducted on bandits and classical MDP benchmarking tasks, show that our estimator is more robust to policy saturation.
arXiv Detail & Related papers (2021-12-22T02:01:19Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - First Order Constrained Optimization in Policy Space [19.00289722198614]
We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS)
FOCOPS maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints.
We provide empirical evidence that our simple approach achieves better performance on a set of constrained robotics locomotive tasks.
arXiv Detail & Related papers (2020-02-16T05:07:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.