Non-Stationary Off-Policy Optimization
- URL: http://arxiv.org/abs/2006.08236v3
- Date: Sun, 4 Apr 2021 06:44:08 GMT
- Title: Non-Stationary Off-Policy Optimization
- Authors: Joey Hong and Branislav Kveton and Manzil Zaheer and Yinlam Chow and
Amr Ahmed
- Abstract summary: We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
- Score: 50.41335279896062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning is a framework for evaluating and optimizing policies
without deploying them, from data collected by another policy. Real-world
environments are typically non-stationary and the offline learned policies
should adapt to these changes. To address this challenge, we study the novel
problem of off-policy optimization in piecewise-stationary contextual bandits.
Our proposed solution has two phases. In the offline learning phase, we
partition logged data into categorical latent states and learn a near-optimal
sub-policy for each state. In the online deployment phase, we adaptively switch
between the learned sub-policies based on their performance. This approach is
practical and analyzable, and we provide guarantees on both the quality of
off-policy optimization and the regret during online deployment. To show the
effectiveness of our approach, we compare it to state-of-the-art baselines on
both synthetic and real-world datasets. Our approach outperforms methods that
act only on observed context.
Related papers
- Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online
Reinforcement Learning [71.02384943570372]
Family Offline-to-Online RL (FamO2O) is a framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances.
FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark.
arXiv Detail & Related papers (2023-10-27T08:30:54Z) - Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning [9.341618348621662]
We aim to find the best-performing policy within a limited budget of online interactions.
We first study the major online RL exploration methods based on intrinsic rewards and UCB.
We then introduce an algorithm for planning to go out-of-distribution that avoids these issues.
arXiv Detail & Related papers (2023-10-09T13:47:05Z) - PAnDR: Fast Adaptation to New Environments from Offline Experiences via
Decoupling Policy and Environment Representations [39.11141327059819]
We propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation.
In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery.
In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent.
arXiv Detail & Related papers (2022-04-06T14:47:35Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Offline Neural Contextual Bandits: Pessimism, Optimization and
Generalization [42.865641215856925]
We propose a provably efficient offline contextual bandit with neural network function approximation.
We show that our method generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works.
We also demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.
arXiv Detail & Related papers (2021-11-27T03:57:13Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Off-policy Learning for Remote Electrical Tilt Optimization [68.8204255655161]
We address the problem of Remote Electrical Tilt (RET) optimization using off-policy Contextual Multi-Armed-Bandit (CMAB) techniques.
We propose CMAB learning algorithms to extract optimal tilt update policies from the data.
Our policies show consistent improvements over the rule-based logging policy used to collect the data.
arXiv Detail & Related papers (2020-05-21T11:30:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.