On Double Descent in Reinforcement Learning with LSTD and Random
Features
- URL: http://arxiv.org/abs/2310.05518v4
- Date: Sun, 18 Feb 2024 10:34:45 GMT
- Title: On Double Descent in Reinforcement Learning with LSTD and Random
Features
- Authors: David Brellmann, Elo\"ise Berthier, David Filliat and Goran Frehse
- Abstract summary: Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL)
We present a theoretical analysis of the influence of network size and $l$-regularization on performance.
We observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one.
- Score: 1.5873758872998507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Difference (TD) algorithms are widely used in Deep Reinforcement
Learning (RL). Their performance is heavily influenced by the size of the
neural network. While in supervised learning, the regime of
over-parameterization and its benefits are well understood, the situation in RL
is much less clear. In this paper, we present a theoretical analysis of the
influence of network size and $l_2$-regularization on performance. We identify
the ratio between the number of parameters and the number of visited states as
a crucial factor and define over-parameterization as the regime when it is
larger than one. Furthermore, we observe a double descent phenomenon, i.e., a
sudden drop in performance around the parameter/state ratio of one. Leveraging
random features and the lazy training regime, we study the regularized
Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as
both the number of parameters and states go to infinity, maintaining a constant
ratio. We derive deterministic limits of both the empirical and the true
Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for
the double descent. Correction terms vanish when the $l_2$-regularization is
increased or the number of unvisited states goes to zero. Numerical experiments
with synthetic and small real-world environments closely match the theoretical
predictions.
Related papers
- A U-turn on Double Descent: Rethinking Parameter Counting in Statistical
Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n.
This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - Lassoed Tree Boosting [53.56229983630983]
We prove that a gradient boosted tree algorithm with early stopping faster than $n-1/4$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation.
Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
arXiv Detail & Related papers (2022-05-22T00:34:41Z) - Understanding Square Loss in Training Overparametrized Neural Network
Classifiers [31.319145959402462]
We contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks.
We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error.
The resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness.
arXiv Detail & Related papers (2021-12-07T12:12:30Z) - Neural Estimation of Statistical Divergences [24.78742908726579]
A modern method for estimating statistical divergences relies on parametrizing an empirical variational form by a neural network (NN)
In particular, there is a fundamental tradeoff between the two sources of error involved: approximation and empirical estimation.
We show that neural estimators with a slightly different NN growth-rate are near minimax rate-optimal, achieving the parametric convergence rate up to logarithmic factors.
arXiv Detail & Related papers (2021-10-07T17:42:44Z) - Nonasymptotic theory for two-layer neural networks: Beyond the
bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function.
We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.