Expected Gradients of Maxout Networks and Consequences to Parameter
Initialization
- URL: http://arxiv.org/abs/2301.06956v2
- Date: Thu, 18 May 2023 15:08:17 GMT
- Title: Expected Gradients of Maxout Networks and Consequences to Parameter
Initialization
- Authors: Hanna Tseran, Guido Mont\'ufar
- Abstract summary: We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution.
Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the gradients of a maxout network with respect to inputs and
parameters and obtain bounds for the moments depending on the architecture and
the parameter distribution. We observe that the distribution of the
input-output Jacobian depends on the input, which complicates a stable
parameter initialization. Based on the moments of the gradients, we formulate
parameter initialization strategies that avoid vanishing and exploding
gradients in wide networks. Experiments with deep fully-connected and
convolutional networks show that this strategy improves SGD and Adam training
of deep maxout networks. In addition, we obtain refined bounds on the expected
number of linear regions, results on the expected curve length distortion, and
results on the NTK.
Related papers
- A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models [6.734175048463699]
We derive a linear convergence rate for gradient descent for two-layer linear neural networks trained with squared loss.<n>Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size.
arXiv Detail & Related papers (2025-05-16T19:57:22Z) - Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation [8.35644084613785]
We introduce the maximal update parameterization ($mu$P) in the infinite-width limit for two representative designs of local targets.
By analyzing deep linear networks, we found that PC's gradients interpolate between first-order and Gauss-Newton-like gradients.
We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient.
arXiv Detail & Related papers (2024-11-04T11:38:27Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Adaptive Multilevel Neural Networks for Parametric PDEs with Error Estimation [0.0]
A neural network architecture is presented to solve high-dimensional parameter-dependent partial differential equations (pPDEs)
It is constructed to map parameters of the model data to corresponding finite element solutions.
It outputs a coarse grid solution and a series of corrections as produced in an adaptive finite element method (AFEM)
arXiv Detail & Related papers (2024-03-19T11:34:40Z) - On the Impact of Overparameterization on the Training of a Shallow
Neural Network in High Dimensions [0.0]
We study the training dynamics of a shallow neural network with quadratic activation functions and quadratic cost.
In line with previous works on the same neural architecture, the optimization is performed following the gradient flow on the population risk.
arXiv Detail & Related papers (2023-11-07T08:20:31Z) - Optimization dependent generalization bound for ReLU networks based on
sensitivity in the tangent bundle [0.0]
We propose a PAC type bound on the generalization error of feedforward ReLU networks.
The obtained bound does not explicitly depend on the depth of the network.
arXiv Detail & Related papers (2023-10-26T13:14:13Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Generalization Error Bounds for Deep Neural Networks Trained by SGD [3.148524502470734]
Generalization error bounds for deep trained by gradient descent (SGD) are derived.
The bounds explicitly depend on the loss along the training trajectory.
Results show that our bounds are non-vacuous and robust with the change of neural networks and network hypers.
arXiv Detail & Related papers (2022-06-07T13:46:10Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.