Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual
Bandits
- URL: http://arxiv.org/abs/2306.07923v2
- Date: Wed, 25 Oct 2023 23:57:55 GMT
- Title: Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual
Bandits
- Authors: Lequn Wang, Akshay Krishnamurthy, Aleksandrs Slivkins
- Abstract summary: We present the first general oracle-efficient algorithm for pessimistic OPO.
We obtain statistical guarantees analogous to those for prior pessimistic approaches.
We show advantage over unregularized OPO across a wide range of configurations.
- Score: 82.28442917447643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider offline policy optimization (OPO) in contextual bandits, where
one is given a fixed dataset of logged interactions. While pessimistic
regularizers are typically used to mitigate distribution shift, prior
implementations thereof are either specialized or computationally inefficient.
We present the first general oracle-efficient algorithm for pessimistic OPO: it
reduces to supervised learning, leading to broad applicability. We obtain
statistical guarantees analogous to those for prior pessimistic approaches. We
instantiate our approach for both discrete and continuous actions and perform
experiments in both settings, showing advantage over unregularized OPO across a
wide range of configurations.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Surpassing legacy approaches to PWR core reload optimization with single-objective Reinforcement learning [0.0]
We have developed methods based on Deep Reinforcement Learning (DRL) for both single- and multi-objective optimization.
In this paper, we demonstrate the advantage of our RL-based approach, specifically using Proximal Policy Optimization (PPO)
PPO adapts its search capability via a policy with learnable weights, allowing it to function as both a global and local search method.
arXiv Detail & Related papers (2024-02-16T19:35:58Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Learning in Observable POMDPs, without Computationally Intractable
Oracles [23.636033995089587]
We develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions.
Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations.
arXiv Detail & Related papers (2022-06-07T17:05:27Z) - Pessimistic Off-Policy Optimization for Learning to Rank [13.733459243449634]
Off-policy learning is a framework for optimizing policies without deploying them.
In recommender systems, this is especially challenging due to the imbalance in logged data.
We study pessimistic off-policy optimization for learning to rank.
arXiv Detail & Related papers (2022-06-06T12:58:28Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Distributionally-Constrained Policy Optimization via Unbalanced Optimal
Transport [15.294456568539148]
We formulate policy optimization as unbalanced optimal transport over the space of occupancy measures.
We propose a general purpose RL objective based on Bregman divergence and optimize it using Dykstra's algorithm.
arXiv Detail & Related papers (2021-02-15T23:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.