Sample Complexity and Overparameterization Bounds for Projection-Free
Neural TD Learning
- URL: http://arxiv.org/abs/2103.01391v1
- Date: Tue, 2 Mar 2021 01:05:19 GMT
- Title: Sample Complexity and Overparameterization Bounds for Projection-Free
Neural TD Learning
- Authors: Semih Cayci, Siddhartha Satpathi, Niao He, R. Srikant
- Abstract summary: Existing analysis of neural TD learning relies on either infinite width-analysis or constraining the network parameters in a (random) compact set.
We show that the projection-free TD learning equipped with a two-layer ReLU network of any width exceeding $poly(overlinenu,1/epsilon)$ converges to the true value function with error $epsilon$ given $poly(overlinenu,1/epsilon)$ iterations or samples.
- Score: 38.730333068555275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the dynamics of temporal-difference learning with neural
network-based value function approximation over a general state space, namely,
\emph{Neural TD learning}. Existing analysis of neural TD learning relies on
either infinite width-analysis or constraining the network parameters in a
(random) compact set; as a result, an extra projection step is required at each
iteration. This paper establishes a new convergence analysis of neural TD
learning \emph{without any projection}. We show that the projection-free TD
learning equipped with a two-layer ReLU network of any width exceeding
$poly(\overline{\nu},1/\epsilon)$ converges to the true value function with
error $\epsilon$ given $poly(\overline{\nu},1/\epsilon)$ iterations or samples,
where $\overline{\nu}$ is an upper bound on the RKHS norm of the value function
induced by the neural tangent kernel. Our sample complexity and
overparameterization bounds are based on a drift analysis of the network
parameters as a stopped random process in the lazy training regime.
Related papers
- Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper.
We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z) - An Improved Finite-time Analysis of Temporal Difference Learning with Deep Neural Networks [11.925232472331494]
We develop an improved non-asymptotic analysis of the neural TD method with a general $L$-layer neural network.
New proof techniques are developed and an improved new $tildemathcalO(epsilon-1)$ sample complexity is derived.
arXiv Detail & Related papers (2024-05-07T05:29:55Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - On the Performance of Temporal Difference Learning With Neural Networks [20.721853144434743]
TD Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation.
We show an approximation bound of $O(epsilon) + tildeO (1/sqrtm)$ where $epsilon$ is the approximation quality of the best neural network.
arXiv Detail & Related papers (2023-12-08T22:34:29Z) - Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of
Neural Networks with Polynomial Width, Samples, and Time [37.73689342377357]
It is still an open question whether gradient descent on networks without unnatural modifications can achieve better sample complexity than kernel methods.
We show that projected gradient descent with a positive learning number converges to low error with the same sample.
arXiv Detail & Related papers (2023-06-28T16:45:38Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Deep neural network approximation of analytic functions [91.3755431537592]
entropy bound for the spaces of neural networks with piecewise linear activation functions.
We derive an oracle inequality for the expected error of the considered penalized deep neural network estimators.
arXiv Detail & Related papers (2021-04-05T18:02:04Z) - Regularization Matters: A Nonparametric Perspective on Overparametrized
Neural Network [20.132432350255087]
Overparametrized neural networks trained by tangent descent (GD) can provably overfit any training data.
This paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises.
arXiv Detail & Related papers (2020-07-06T01:02:23Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.