Ridge Regression with Over-Parametrized Two-Layer Networks Converge to
Ridgelet Spectrum
- URL: http://arxiv.org/abs/2007.03441v2
- Date: Fri, 19 Feb 2021 07:28:27 GMT
- Title: Ridge Regression with Over-Parametrized Two-Layer Networks Converge to
Ridgelet Spectrum
- Authors: Sho Sonoda, Isao Ishikawa, Masahiro Ikeda
- Abstract summary: We show that the distribution of the parameters converges to a spectrum of the ridgelet transform.
This result provides a new insight into the characterization of the local minima of neural networks.
- Score: 10.05944106581306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Characterization of local minima draws much attention in theoretical studies
of deep learning. In this study, we investigate the distribution of parameters
in an over-parametrized finite neural network trained by ridge regularized
empirical square risk minimization (RERM). We develop a new theory of ridgelet
transform, a wavelet-like integral transform that provides a powerful and
general framework for the theoretical study of neural networks involving not
only the ReLU but general activation functions. We show that the distribution
of the parameters converges to a spectrum of the ridgelet transform. This
result provides a new insight into the characterization of the local minima of
neural networks, and the theoretical background of an inductive bias theory
based on lazy regimes. We confirm the visual resemblance between the parameter
distribution trained by SGD, and the ridgelet spectrum calculated by numerical
integration through numerical experiments with finite models.
Related papers
- Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology.
We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK)
This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z) - Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - Learning Theory of Distribution Regression with Neural Networks [6.961253535504979]
We establish an approximation theory and a learning theory of distribution regression via a fully connected neural network (FNN)
In contrast to the classical regression methods, the input variables of distribution regression are probability measures.
arXiv Detail & Related papers (2023-07-07T09:49:11Z) - Neural Characteristic Activation Analysis and Geometric Parameterization for ReLU Networks [2.2713084727838115]
We introduce a novel approach for analyzing the training dynamics of ReLU networks by examining the characteristic activation boundaries of individual neurons.
Our proposed analysis reveals a critical instability in common neural network parameterizations and normalizations during convergence optimization, which impedes fast convergence and hurts performance.
arXiv Detail & Related papers (2023-05-25T10:19:13Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Nonasymptotic theory for two-layer neural networks: Beyond the
bias-variance trade-off [10.182922771556742]
We present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function.
We show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
arXiv Detail & Related papers (2021-06-09T03:52:18Z) - Training Sparse Neural Network by Constraining Synaptic Weight on Unit
Lp Sphere [2.429910016019183]
constraining the synaptic weights on unit Lp-sphere enables the flexibly control of the sparsity with p.
Our approach is validated by experiments on benchmark datasets covering a wide range of domains.
arXiv Detail & Related papers (2021-03-30T01:02:31Z) - Double-descent curves in neural networks: a new perspective using
Gaussian processes [9.153116600213641]
Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters.
We use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent of the spectrum of the neural network Gaussian process kernel.
arXiv Detail & Related papers (2021-02-14T20:31:49Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - Generalization bound of globally optimal non-convex neural network
training: Transportation map estimation by infinite dimensional Langevin
dynamics [50.83356836818667]
We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error.
Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence.
arXiv Detail & Related papers (2020-07-11T18:19:50Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.