Unified Off-Policy Learning to Rank: a Reinforcement Learning
Perspective
- URL: http://arxiv.org/abs/2306.07528v3
- Date: Sat, 28 Oct 2023 06:12:37 GMT
- Title: Unified Off-Policy Learning to Rank: a Reinforcement Learning
Perspective
- Authors: Zeyu Zhang, Yi Su, Hui Yuan, Yiran Wu, Rishab Balasubramanian, Qingyun
Wu, Huazheng Wang, Mengdi Wang
- Abstract summary: Off-policy learning to rank methods often make strong assumptions about how users generate the click data.
We show that offline reinforcement learning can adapt to various click models without complex debiasing techniques and prior knowledge of the model.
Results on various large-scale datasets demonstrate that CUOLR consistently outperforms the state-of-the-art off-policy learning to rank algorithms.
- Score: 61.4025671743675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy Learning to Rank (LTR) aims to optimize a ranker from data
collected by a deployed logging policy. However, existing off-policy learning
to rank methods often make strong assumptions about how users generate the
click data, i.e., the click model, and hence need to tailor their methods
specifically under different click models. In this paper, we unified the
ranking process under general stochastic click models as a Markov Decision
Process (MDP), and the optimal ranking could be learned with offline
reinforcement learning (RL) directly. Building upon this, we leverage offline
RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified
Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a
wide range of click models. Through a dedicated formulation of the MDP, we show
that offline RL algorithms can adapt to various click models without complex
debiasing techniques and prior knowledge of the model. Results on various
large-scale datasets demonstrate that CUOLR consistently outperforms the
state-of-the-art off-policy learning to rank algorithms while maintaining
consistency and robustness under different click models.
Related papers
- Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - A Unified Framework for Alternating Offline Model Training and Policy
Learning [62.19209005400561]
In offline model-based reinforcement learning, we learn a dynamic model from historically collected data, and utilize the learned model and fixed datasets for policy learning.
We develop an iterative offline MBRL framework, where we maximize a lower bound of the true expected return.
With the proposed unified model-policy learning framework, we achieve competitive performance on a wide range of continuous-control offline reinforcement learning datasets.
arXiv Detail & Related papers (2022-10-12T04:58:51Z) - Meta-Reinforcement Learning for Adaptive Control of Second Order Systems [3.131740922192114]
In process control, many systems have similar and well-understood dynamics, which suggests it is feasible to create a generalizable controller through meta-learning.
We formulate a meta reinforcement learning (meta-RL) control strategy that takes advantage of known, offline information for training, such as a model structure.
A key design element is the ability to leverage model-based information offline during training, while maintaining a model-free policy structure for interacting with new environments.
arXiv Detail & Related papers (2022-09-19T18:51:33Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Meta Reinforcement Learning for Adaptive Control: An Offline Approach [3.131740922192114]
We formulate a meta reinforcement learning (meta-RL) control strategy that takes advantage of known, offline information for training.
Our meta-RL agent has a recurrent structure that accumulates "context" for its current dynamics through a hidden state variable.
In tests reported here, the meta-RL agent was trained entirely offline, yet produced excellent results in novel settings.
arXiv Detail & Related papers (2022-03-17T23:58:52Z) - Online and Offline Reinforcement Learning by Planning with a Learned
Model [15.8026041700727]
We describe the Reanalyse algorithm which uses model-based policy and value improvement operators to compute new improved training targets on existing data points.
We show that Reanalyse can also be used to learn entirely from demonstrations without any environment interactions.
We introduce MuZero Unplugged, a single unified algorithm for any data budget, including offline RL.
arXiv Detail & Related papers (2021-04-13T15:36:06Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.