On the optimization and generalization of overparameterized implicit
neural networks
- URL: http://arxiv.org/abs/2209.15562v1
- Date: Fri, 30 Sep 2022 16:19:46 GMT
- Title: On the optimization and generalization of overparameterized implicit
neural networks
- Authors: Tianxiang Gao, Hongyang Gao
- Abstract summary: Implicit neural networks have become increasingly attractive in the machine learning community.
We show that global convergence is guaranteed, even if only the implicit layer is trained.
This paper investigates the generalization error for implicit neural networks.
- Score: 25.237054775800164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Implicit neural networks have become increasingly attractive in the machine
learning community since they can achieve competitive performance but use much
less computational resources. Recently, a line of theoretical works established
the global convergences for first-order methods such as gradient descent if the
implicit networks are over-parameterized. However, as they train all layers
together, their analyses are equivalent to only studying the evolution of the
output layer. It is unclear how the implicit layer contributes to the training.
Thus, in this paper, we restrict ourselves to only training the implicit layer.
We show that global convergence is guaranteed, even if only the implicit layer
is trained. On the other hand, the theoretical understanding of when and how
the training performance of an implicit neural network can be generalized to
unseen data is still under-explored. Although this problem has been studied in
standard feed-forward networks, the case of implicit neural networks is still
intriguing since implicit networks theoretically have infinitely many layers.
Therefore, this paper investigates the generalization error for implicit neural
networks. Specifically, we study the generalization of an implicit network
activated by the ReLU function over random initialization. We provide a
generalization bound that is initialization sensitive. As a result, we show
that gradient flow with proper random initialization can train a sufficient
over-parameterized implicit network to achieve arbitrarily small generalization
errors.
Related papers
- Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - Critical Initialization of Wide and Deep Neural Networks through Partial
Jacobians: General Theory and Applications [6.579523168465526]
We introduce emphpartial Jacobians of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0leq l$.
We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections.
arXiv Detail & Related papers (2021-11-23T20:31:42Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - A global convergence theory for deep ReLU implicit networks via
over-parameterization [26.19122384935622]
Implicit deep learning has received increasing attention recently.
This paper analyzes the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks.
arXiv Detail & Related papers (2021-10-11T23:22:50Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Compressive Sensing and Neural Networks from a Statistical Learning
Perspective [4.561032960211816]
We present a generalization error analysis for a class of neural networks suitable for sparse reconstruction from few linear measurements.
Under realistic conditions, the generalization error scales only logarithmically in the number of layers, and at most linear in number of measurements.
arXiv Detail & Related papers (2020-10-29T15:05:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.