Understanding Multi-phase Optimization Dynamics and Rich Nonlinear
Behaviors of ReLU Networks
- URL: http://arxiv.org/abs/2305.12467v5
- Date: Wed, 27 Dec 2023 12:37:18 GMT
- Title: Understanding Multi-phase Optimization Dynamics and Rich Nonlinear
Behaviors of ReLU Networks
- Authors: Mingze Wang, Chao Ma
- Abstract summary: We conduct a theoretical characterization of the training process of two-layer ReLU network trained by Gradient Flow on a linearlyparable data.
We reveal four different phases from the whole training process showing a general simplifying-to-complicating learning trend.
Specific nonlinear behaviors can also be precisely identified captured theoretically, such as initial, saddle-plateau dynamics, condensation escape, changes of activation patterns with increasing complexity.
- Score: 8.180184504355571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training process of ReLU neural networks often exhibits complicated
nonlinear phenomena. The nonlinearity of models and non-convexity of loss pose
significant challenges for theoretical analysis. Therefore, most previous
theoretical works on the optimization dynamics of neural networks focus either
on local analysis (like the end of training) or approximate linear models (like
Neural Tangent Kernel). In this work, we conduct a complete theoretical
characterization of the training process of a two-layer ReLU network trained by
Gradient Flow on a linearly separable data. In this specific setting, our
analysis captures the whole optimization process starting from random
initialization to final convergence. Despite the relatively simple model and
data that we studied, we reveal four different phases from the whole training
process showing a general simplifying-to-complicating learning trend. Specific
nonlinear behaviors can also be precisely identified and captured
theoretically, such as initial condensation, saddle-to-plateau dynamics,
plateau escape, changes of activation patterns, learning with increasing
complexity, etc.
Related papers
- Absence of Closed-Form Descriptions for Gradient Flow in Two-Layer Narrow Networks [0.8158530638728501]
We show that the dynamics of the gradient flow in two-layer narrow networks is not an integrable system.
Under mild conditions, the identity component of the differential Galois group of the variational equations of the gradient flow is non-solvable.
This result confirms the system's non-integrability and implies that the training dynamics cannot be represented by Liouvillian functions.
arXiv Detail & Related papers (2024-08-15T17:40:11Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Theoretical Characterization of the Generalization Performance of
Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features.
We find new and interesting properties that do not exist in single-task linear regression.
Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z) - Physics Informed Piecewise Linear Neural Networks for Process
Optimization [0.0]
It is proposed to upgrade piece-wise linear neural network models with physics informed knowledge for optimization problems with neural network models embedded.
For all cases, physics-informed trained neural network based optimal results are closer to global optimality.
arXiv Detail & Related papers (2023-02-02T10:14:54Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - A purely data-driven framework for prediction, optimization, and control
of networked processes: application to networked SIS epidemic model [0.8287206589886881]
We develop a data-driven framework based on operator-theoretic techniques to identify and control nonlinear dynamics over large-scale networks.
The proposed approach requires no prior knowledge of the network structure and identifies the underlying dynamics solely using a collection of two-step snapshots of the states.
arXiv Detail & Related papers (2021-08-01T03:57:10Z) - Edge of chaos as a guiding principle for modern neural network training [19.419382003562976]
We study the role of various hyperparameters in modern neural network training algorithms in terms of the order-chaos phase diagram.
In particular, we study a fully analytical feedforward neural network trained on the widely adopted Fashion-MNIST dataset.
arXiv Detail & Related papers (2021-07-20T12:17:55Z) - Learning Fast Approximations of Sparse Nonlinear Regression [50.00693981886832]
In this work, we bridge the gap by introducing the Threshold Learned Iterative Shrinkage Algorithming (NLISTA)
Experiments on synthetic data corroborate our theoretical results and show our method outperforms state-of-the-art methods.
arXiv Detail & Related papers (2020-10-26T11:31:08Z) - DynNet: Physics-based neural architecture design for linear and
nonlinear structural response modeling and prediction [2.572404739180802]
In this study, a physics-based recurrent neural network model is designed that is able to learn the dynamics of linear and nonlinear multiple degrees of freedom systems.
The model is able to estimate a complete set of responses, including displacement, velocity, acceleration, and internal forces.
arXiv Detail & Related papers (2020-07-03T17:05:35Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.