Model Selection in Batch Policy Optimization
- URL: http://arxiv.org/abs/2112.12320v1
- Date: Thu, 23 Dec 2021 02:31:50 GMT
- Title: Model Selection in Batch Policy Optimization
- Authors: Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai
- Abstract summary: We study the problem of model selection in batch policy optimization.
We identify three sources of error that any model selection algorithm should optimally trade-off in order to be competitive.
- Score: 88.52887493684078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of model selection in batch policy optimization: given a
fixed, partial-feedback dataset and $M$ model classes, learn a policy with
performance that is competitive with the policy derived from the best model
class. We formalize the problem in the contextual bandit setting with linear
model classes by identifying three sources of error that any model selection
algorithm should optimally trade-off in order to be competitive: (1)
approximation error, (2) statistical complexity, and (3) coverage. The first
two sources are common in model selection for supervised learning, where
optimally trading-off these properties is well-studied. In contrast, the third
source is unique to batch policy optimization and is due to dataset shift
inherent to the setting. We first show that no batch policy optimization
algorithm can achieve a guarantee addressing all three simultaneously,
revealing a stark contrast between difficulties in batch policy optimization
and the positive results available in supervised learning. Despite this
negative result, we show that relaxing any one of the three error sources
enables the design of algorithms achieving near-oracle inequalities for the
remaining two. We conclude with experiments demonstrating the efficacy of these
algorithms.
Related papers
- Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [61.580419063416734]
A recent stream of structured learning approaches has improved the practical state of the art for a range of optimization problems.
The key idea is to exploit the statistical distribution over instances instead of dealing with instances separately.
In this article, we investigate methods that smooth the risk by perturbing the policy, which eases optimization and improves the generalization error.
arXiv Detail & Related papers (2024-07-24T12:00:30Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z) - Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - Probabilistic Bilevel Coreset Selection [24.874967723659022]
We propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample.
We develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation.
arXiv Detail & Related papers (2023-01-24T09:37:00Z) - Oracle Inequalities for Model Selection in Offline Reinforcement
Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation.
We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors.
We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z) - Efficient Learning of Decision-Making Models: A Penalty Block Coordinate
Descent Algorithm for Data-Driven Inverse Optimization [12.610576072466895]
We consider the inverse problem where we use prior decision data to uncover the underlying decision-making process.
This statistical learning problem is referred to as data-driven inverse optimization.
We propose an efficient block coordinate descent-based algorithm to solve large problem instances.
arXiv Detail & Related papers (2022-10-27T12:52:56Z) - Cost-Effective Online Contextual Model Selection [14.094350329970537]
We formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context.
The goal is to output the best model for any given context without obtaining an excessive amount of labels.
We propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection.
arXiv Detail & Related papers (2022-07-13T08:22:22Z) - Sample Selection for Fair and Robust Training [28.94276265328868]
We propose a sample selection-based algorithm for fair and robust training.
We show that our algorithm obtains fairness and robustness better than or comparable to the state-of-the-art technique.
arXiv Detail & Related papers (2021-10-27T07:17:29Z) - Data-efficient Hindsight Off-policy Option Learning [20.42535406663446]
We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm.
It robustly trains all policy components off-policy and end-to-end.
The approach outperforms existing option learning methods on common benchmarks.
arXiv Detail & Related papers (2020-07-30T16:52:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.