Offline Contextual Bandits with Overparameterized Models
- URL: http://arxiv.org/abs/2006.15368v4
- Date: Wed, 16 Jun 2021 16:15:32 GMT
- Title: Offline Contextual Bandits with Overparameterized Models
- Authors: David Brandfonbrener, William F. Whitney, Rajesh Ranganath, Joan Bruna
- Abstract summary: We ask whether the same phenomenon occurs for offline contextual bandits.
We show that this discrepancy is due to the emphaction-stability of their objectives.
In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
- Score: 52.788628474552276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent results in supervised learning suggest that while overparameterized
models have the capacity to overfit, they in fact generalize quite well. We ask
whether the same phenomenon occurs for offline contextual bandits. Our results
are mixed. Value-based algorithms benefit from the same generalization behavior
as overparameterized supervised learning, but policy-based algorithms do not.
We show that this discrepancy is due to the \emph{action-stability} of their
objectives. An objective is action-stable if there exists a prediction
(action-value vector or action distribution) which is optimal no matter which
action is observed. While value-based objectives are action-stable,
policy-based objectives are unstable. We formally prove upper bounds on the
regret of overparameterized value-based learning and lower bounds on the regret
for policy-based algorithms. In our experiments with large neural networks,
this gap between action-stable value-based objectives and unstable policy-based
objectives leads to significant performance differences.
Related papers
- Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies [24.706986328622193]
We consider off-policy evaluation of deterministic target policies for reinforcement learning.
We learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function.
We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric.
arXiv Detail & Related papers (2024-05-29T06:17:33Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Uncertainty-boosted Robust Video Activity Anticipation [72.14155465769201]
Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision to autonomous driving.
Despite the recent progress, the data uncertainty issue, reflected as the content evolution process and dynamic correlation in event labels, has been somehow ignored.
We propose an uncertainty-boosted robust video activity anticipation framework, which generates uncertainty values to indicate the credibility of the anticipation results.
arXiv Detail & Related papers (2024-04-29T12:31:38Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Why Target Networks Stabilise Temporal Difference Methods [38.35578010611503]
We show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed.
We conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update.
arXiv Detail & Related papers (2023-02-24T09:46:00Z) - Bridging the Gap Between Target Networks and Functional Regularization [61.051716530459586]
We propose an explicit Functional Regularization that is a convex regularizer in function space and can easily be tuned.
We analyze the convergence of our method theoretically and empirically demonstrate that replacing Target Networks with the more theoretically grounded Functional Regularization approach leads to better sample efficiency and performance improvements.
arXiv Detail & Related papers (2022-10-21T22:27:07Z) - Model-Free and Model-Based Policy Evaluation when Causality is Uncertain [7.858296711223292]
In off-policy evaluation, there may exist unobserved variables that both impact the dynamics and are used by the unknown behavior policy.
We develop worst-case bounds to assess sensitivity to these unobserved confounders in finite horizons.
We show that a model-based approach with robust MDPs gives sharper lower bounds by exploiting domain knowledge about the dynamics.
arXiv Detail & Related papers (2022-04-02T23:40:15Z) - On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations.
Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.