An Experimental Comparison Between Temporal Difference and Residual
Gradient with Neural Network Approximation
- URL: http://arxiv.org/abs/2205.12770v1
- Date: Wed, 25 May 2022 13:37:52 GMT
- Title: An Experimental Comparison Between Temporal Difference and Residual
Gradient with Neural Network Approximation
- Authors: Shuyu Yin, Tao Luo, Peilin Liu, Zhi-Qin John Xu
- Abstract summary: In deep Q-learning with neural network approximation, gradient descent is barely used to solve Bellman residual minimization problem.
In this work, we perform extensive experiments to show that Temporal Difference (TD) outperforms gradient descent (RG)
We also empirically examine that the missing term in TD is a key reason why RG performs badly.
- Score: 8.166265682999482
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Gradient descent or its variants are popular in training neural networks.
However, in deep Q-learning with neural network approximation, a type of
reinforcement learning, gradient descent (also known as Residual Gradient (RG))
is barely used to solve Bellman residual minimization problem. On the contrary,
Temporal Difference (TD), an incomplete gradient descent method prevails. In
this work, we perform extensive experiments to show that TD outperforms RG,
that is, when the training leads to a small Bellman residual error, the
solution found by TD has a better policy and is more robust against the
perturbation of neural network parameters. We further use experiments to reveal
a key difference between reinforcement learning and supervised learning, that
is, a small Bellman residual error can correspond to a bad policy in
reinforcement learning while the test loss function in supervised learning is a
standard index to indicate the performance. We also empirically examine that
the missing term in TD is a key reason why RG performs badly. Our work shows
that the performance of a deep Q-learning solution is closely related to the
training dynamics and how an incomplete gradient descent method can find a good
policy is interesting for future study.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks [4.21061712600981]
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
arXiv Detail & Related papers (2023-05-20T07:18:06Z) - Training a Two Layer ReLU Network Analytically [4.94950858749529]
We will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss.
The method is faster than the gradient descent methods and has virtually no tuning parameters.
arXiv Detail & Related papers (2023-04-06T09:57:52Z) - Alternate Loss Functions for Classification and Robust Regression Can Improve the Accuracy of Artificial Neural Networks [6.452225158891343]
This paper shows that training speed and final accuracy of neural networks can significantly depend on the loss function used to train neural networks.
Two new classification loss functions that significantly improve performance on a wide variety of benchmark tasks are proposed.
arXiv Detail & Related papers (2023-03-17T12:52:06Z) - Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise.
We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Learning Lipschitz Functions by GD-trained Shallow Overparameterized
ReLU Neural Networks [12.018422134251384]
We show that neural networks trained to nearly zero training error are inconsistent in this class.
We show that whenever some early stopping rule is guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the kernel induced by the ReLU activation function, the same rule can be used to achieve minimax optimal rate.
arXiv Detail & Related papers (2022-12-28T14:56:27Z) - Early Stage Convergence and Global Convergence of Training Mildly
Parameterized Neural Networks [3.148524502470734]
We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast.
We use a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient.
arXiv Detail & Related papers (2022-06-05T09:56:50Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.