How do noise tails impact on deep ReLU networks?
- URL: http://arxiv.org/abs/2203.10418v1
- Date: Sun, 20 Mar 2022 00:27:32 GMT
- Title: How do noise tails impact on deep ReLU networks?
- Authors: Jianqing Fan, Yihong Gu, Wen-Xin Zhou
- Abstract summary: We show how the optimal rate of convergence depends on p, the degree of smoothness and the intrinsic dimension in a class of nonparametric regression functions.
We also contribute some new results on the approximation theory of deep ReLU neural networks.
- Score: 2.5889847253961418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the stability of deep ReLU neural networks for
nonparametric regression under the assumption that the noise has only a finite
p-th moment. We unveil how the optimal rate of convergence depends on p, the
degree of smoothness and the intrinsic dimension in a class of nonparametric
regression functions with hierarchical composition structure when both the
adaptive Huber loss and deep ReLU neural networks are used. This optimal rate
of convergence cannot be obtained by the ordinary least squares but can be
achieved by the Huber loss with a properly chosen parameter that adapts to the
sample size, smoothness, and moment parameters. A concentration inequality for
the adaptive Huber ReLU neural network estimators with allowable optimization
errors is also derived. To establish a matching lower bound within the class of
neural network estimators using the Huber loss, we employ a different strategy
from the traditional route: constructing a deep ReLU network estimator that has
a better empirical loss than the true function and the difference between these
two functions furnishes a low bound. This step is related to the Huberization
bias, yet more critically to the approximability of deep ReLU networks. As a
result, we also contribute some new results on the approximation theory of deep
ReLU neural networks.
Related papers
- Benign Overfitting for Regression with Trained Two-Layer ReLU Networks [14.36840959836957]
We study the least-square regression problem with a two-layer fully-connected neural network, with ReLU activation function, trained by gradient flow.
Our first result is a generalization result, that requires no assumptions on the underlying regression function or the noise other than that they are bounded.
arXiv Detail & Related papers (2024-10-08T16:54:23Z) - Equidistribution-based training of Free Knot Splines and ReLU Neural Networks [0.0]
We show that the $L$ based approximation problem is ill-conditioned using shallow neural networks (NNs) with a rectified linear unit (ReLU) activation function.
We propose a two-level procedure for training the FKS by first solving the nonlinear problem of finding the optimal knot locations.
We then determine the optimal weights and knots of the FKS by solving a nearly linear, well-conditioned problem.
arXiv Detail & Related papers (2024-07-02T10:51:36Z) - Semi-Supervised Deep Sobolev Regression: Estimation and Variable Selection by ReQU Neural Network [3.4623717820849476]
We propose SDORE, a Semi-supervised Deep Sobolev Regressor, for the nonparametric estimation of the underlying regression function and its gradient.
Our study includes a thorough analysis of the convergence rates of SDORE in $L2$-norm, achieving the minimax optimality.
arXiv Detail & Related papers (2024-01-09T13:10:30Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Optimal rates of approximation by shallow ReLU$^k$ neural networks and
applications to nonparametric regression [12.21422686958087]
We study the approximation capacity of some variation spaces corresponding to shallow ReLU$k$ neural networks.
For functions with less smoothness, the approximation rates in terms of the variation norm are established.
We show that shallow neural networks can achieve the minimax optimal rates for learning H"older functions.
arXiv Detail & Related papers (2023-04-04T06:35:02Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Jensen-Shannon Divergence Based Novel Loss Functions for Bayesian Neural Networks [2.4554686192257424]
We formulate a novel loss function for BNNs based on a new modification to the generalized Jensen-Shannon (JS) divergence, which is bounded.
We find that the JS divergence-based variational inference is intractable, and hence employed a constrained optimization framework to formulate these losses.
Our theoretical analysis and empirical experiments on multiple regression and classification data sets suggest that the proposed losses perform better than the KL divergence-based loss, especially when the data sets are noisy or biased.
arXiv Detail & Related papers (2022-09-23T01:47:09Z) - Optimization-Based Separations for Neural Networks [57.875347246373956]
We show that gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations.
This is the first optimization-based separation result where the approximation benefits of the stronger architecture provably manifest in practice.
arXiv Detail & Related papers (2021-12-04T18:07:47Z) - Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks [65.24701908364383]
We show that a sufficient condition for a uncertainty on a ReLU network is "to be a bit Bayesian calibrated"
We further validate these findings empirically via various standard experiments using common deep ReLU networks and Laplace approximations.
arXiv Detail & Related papers (2020-02-24T08:52:06Z) - Beyond Dropout: Feature Map Distortion to Regularize Deep Neural
Networks [107.77595511218429]
In this paper, we investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks.
We propose a feature distortion method (Disout) for addressing the aforementioned problem.
The superiority of the proposed feature map distortion for producing deep neural network with higher testing performance is analyzed and demonstrated.
arXiv Detail & Related papers (2020-02-23T13:59:13Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.