Overcoming the Spectral Bias of Neural Value Approximation
- URL: http://arxiv.org/abs/2206.04672v1
- Date: Thu, 9 Jun 2022 17:59:57 GMT
- Title: Overcoming the Spectral Bias of Neural Value Approximation
- Authors: Ge Yang, Anurag Ajay, Pulkit Agrawal
- Abstract summary: Value approximation using deep neural networks is often the primary module that provides learning signals to the rest of the algorithm.
Recent works in neural kernel regression suggest the presence of a spectral bias, where fitting high-frequency components of the value function requires exponentially more gradient update steps than the low-frequency ones.
We re-examine off-policy reinforcement learning through the lens of kernel regression and propose to overcome such bias via a composite neural kernel.
- Score: 17.546011419043644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Value approximation using deep neural networks is at the heart of off-policy
deep reinforcement learning, and is often the primary module that provides
learning signals to the rest of the algorithm. While multi-layer perceptron
networks are universal function approximators, recent works in neural kernel
regression suggest the presence of a spectral bias, where fitting
high-frequency components of the value function requires exponentially more
gradient update steps than the low-frequency ones. In this work, we re-examine
off-policy reinforcement learning through the lens of kernel regression and
propose to overcome such bias via a composite neural tangent kernel. With just
a single line-change, our approach, the Fourier feature networks (FFN) produce
state-of-the-art performance on challenging continuous control domains with
only a fraction of the compute. Faster convergence and better off-policy
stability also make it possible to remove the target network without suffering
catastrophic divergences, which further reduces TD}(0)'s estimation bias on a
few tasks.
Related papers
- Deep Learning without Global Optimization by Random Fourier Neural Networks [0.0]
We introduce a new training algorithm for variety of deep neural networks that utilize random complex exponential activation functions.
Our approach employs a Markov Chain Monte Carlo sampling procedure to iteratively train network layers.
It consistently attains the theoretical approximation rate for residual networks with complex exponential activation functions.
arXiv Detail & Related papers (2024-07-16T16:23:40Z) - Neural Network-Based Score Estimation in Diffusion Models: Optimization
and Generalization [12.812942188697326]
Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness.
A key component of these models is to learn the score function through score matching.
Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy.
arXiv Detail & Related papers (2024-01-28T08:13:56Z) - Multi-stage Neural Networks: Function Approximator of Machine Precision [0.456877715768796]
We develop multi-stage neural networks that reduce prediction errors below $O(10-16)$ with large network size and extended training iterations.
We demonstrate that the prediction error from the multi-stage training for both regression problems and physics-informed neural networks can nearly reach the machine-precision $O(10-16)$ of double-floating point within a finite number of iterations.
arXiv Detail & Related papers (2023-07-18T02:47:32Z) - A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree
Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions.
We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets.
We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs.
The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z) - Stable Low-rank Tensor Decomposition for Compression of Convolutional
Neural Network [19.717842489217684]
This paper is the first study on degeneracy in the tensor decomposition of convolutional kernels.
We present a novel method, which can stabilize the low-rank approximation of convolutional kernels and ensure efficient compression.
We evaluate our approach on popular CNN architectures for image classification and show that our method results in much lower accuracy degradation and provides consistent performance.
arXiv Detail & Related papers (2020-08-12T17:10:12Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.