Beyond variance reduction: Understanding the true impact of baselines on
policy optimization
- URL: http://arxiv.org/abs/2008.13773v3
- Date: Fri, 19 Feb 2021 18:10:59 GMT
- Title: Beyond variance reduction: Understanding the true impact of baselines on
policy optimization
- Authors: Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux
- Abstract summary: We show that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates.
We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics.
- Score: 24.09670734037029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bandit and reinforcement learning (RL) problems can often be framed as
optimization problems where the goal is to maximize average performance while
having access only to stochastic estimates of the true gradient. Traditionally,
stochastic optimization theory predicts that learning dynamics are governed by
the curvature of the loss function and the noise of the gradient estimates. In
this paper we demonstrate that this is not the case for bandit and RL problems.
To allow our analysis to be interpreted in light of multi-step MDPs, we focus
on techniques derived from stochastic optimization principles (e.g., natural
policy gradient and EXP3) and we show that some standard assumptions from
optimization theory are violated in these problems. We present theoretical
results showing that, at least for bandit problems, curvature and noise are not
sufficient to explain the learning dynamics and that seemingly innocuous
choices like the baseline can determine whether an algorithm converges. These
theoretical findings match our empirical evaluation, which we extend to
multi-state MDPs.
Related papers
- On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling [0.6906005491572401]
The paper provides theoretical insights on why large batch sizes fall into sharp local minima, why decaying learning rates and increasing batch sizes are superior to fixed learning rates, and what the optimal learning rate is.
arXiv Detail & Related papers (2023-11-15T07:27:40Z) - Model-Based Reparameterization Policy Gradient Methods: Theory and
Practical Algorithms [88.74308282658133]
Reization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics.
Recent studies have revealed that, when applied to long-term reinforcement learning problems, model-based RP PGMs may experience chaotic and non-smooth optimization landscapes.
We propose a spectral normalization method to mitigate the exploding variance issue caused by long model unrolls.
arXiv Detail & Related papers (2023-10-30T18:43:21Z) - Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
In this paper, we develop a simple yet efficient approach for the problem of comparing uncertain quantities.
We recast inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability.
The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.
arXiv Detail & Related papers (2022-11-14T21:54:31Z) - Instance-Dependent Confidence and Early Stopping for Reinforcement
Learning [99.57168572237421]
Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure.
This research provides guarantees that explain textitex post the performance differences observed.
A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice.
arXiv Detail & Related papers (2022-01-21T04:25:35Z) - AdaTerm: Adaptive T-Distribution Estimated Robust Moments for
Noise-Robust Stochastic Gradient Optimization [14.531550983885772]
We propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all associated statistics.
This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time.
arXiv Detail & Related papers (2022-01-18T03:13:19Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - A theoretical and empirical study of new adaptive algorithms with
additional momentum steps and shifted updates for stochastic non-convex
optimization [0.0]
It is thought that adaptive optimization algorithms represent the key pillar behind the of the Learning field.
In this paper we introduce adaptive momentum techniques for different non-smooth objective problems.
arXiv Detail & Related papers (2021-10-16T09:47:57Z) - Adaptive Approximate Policy Iteration [22.915651391812187]
We present a learning scheme which enjoys a $tildeO(T2/3)$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs.
This is an improvement over the best existing bound of $tildeO(T3/4)$ for the average-reward case with function approximation.
arXiv Detail & Related papers (2020-02-08T02:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.