Related papers: Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations

Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations

URL: http://arxiv.org/abs/2110.05960v1
Date: Mon, 11 Oct 2021 17:17:20 GMT
Title: Imitating Deep Learning Dynamics via Locally Elastic Stochastic Differential Equations
Authors: Jiayao Zhang, Hua Wang, Weijie J. Su
Abstract summary: We study the evolution of features during deep learning training using a set of differential equations (SDEs) that each corresponds to a training sample. Our results shed light on the decisive role of local elasticity in the training dynamics of neural networks.
Score: 20.066631203802302
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding the training dynamics of deep learning models is perhaps a necessary step toward demystifying the effectiveness of these models. In particular, how do data from different classes gradually become separable in their feature spaces when training neural networks using stochastic gradient descent? In this study, we model the evolution of features during deep learning training using a set of stochastic differential equations (SDEs) that each corresponds to a training sample. As a crucial ingredient in our modeling strategy, each SDE contains a drift term that reflects the impact of backpropagation at an input on the features of all samples. Our main finding uncovers a sharp phase transition phenomenon regarding the {intra-class impact: if the SDEs are locally elastic in the sense that the impact is more significant on samples from the same class as the input, the features of the training data become linearly separable, meaning vanishing training loss; otherwise, the features are not separable, regardless of how long the training time is. Moreover, in the presence of local elasticity, an analysis of our SDEs shows that the emergence of a simple geometric structure called the neural collapse of the features. Taken together, our results shed light on the decisive role of local elasticity in the training dynamics of neural networks. We corroborate our theoretical analysis with experiments on a synthesized dataset of geometric shapes and CIFAR-10.

Related papers

Training Dynamics of In-Context Learning in Linear Attention [6.663503238373593]
We study descent gradient dynamics of multi-head linear self-attention trained for in-context linear regression. We characterize how in-context learning abilities evolve during descent training of linear attention.
arXiv Detail & Related papers (2025-01-27T18:03:00Z)
TRENDy: Temporal Regression of Effective Nonlinear Dynamics [4.264200809234798]
TRENDy is an equation-free approach to learning low-temporal dynamics. We train TRENDy to predict the effective dynamics of synthetic and real data representing dynamics from across the physical and life sciences. We find that TRENDy's predicted effective state not only accurately predicts changes over time but also identifies distinct pattern features.
arXiv Detail & Related papers (2024-12-04T17:36:47Z)
Training Dynamics of Nonlinear Contrastive Learning Model in the High Dimensional Limit [1.7597525104451157]
An empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE) Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations (ODEs) We analyze the fixed point locations and their stability of the ODEs unveiling several interesting findings.
arXiv Detail & Related papers (2024-06-11T03:07:41Z)
Automatic Differentiation is Essential in Training Neural Networks for Solving Differential Equations [7.890817997914349]
Neural network-based approaches have recently shown significant promise in solving partial differential equations (PDEs) in science and engineering. One advantage of the neural network methods for PDEs lies in its automatic differentiation (AD) In this paper, we quantitatively demonstrate the advantage of AD in training neural networks.
arXiv Detail & Related papers (2024-05-23T02:01:05Z)
Tipping Points of Evolving Epidemiological Networks: Machine Learning-Assisted, Data-Driven Effective Modeling [0.0]
We study the tipping point collective dynamics of an adaptive susceptible-infected (SIS) epidemiological network in a data-driven, machine learning-assisted manner. We identify a complex effective differential equation (eSDE) in terms physically meaningful coarse mean-field variables. We study the statistics of rare events both through repeated brute force simulations and by using established mathematical/computational tools.
arXiv Detail & Related papers (2023-11-01T19:33:03Z)
Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations [68.62843292346813]
We propose a structured latent ODE model that captures system input variations within its latent representation. Building on a static variable specification, our model learns factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space.
arXiv Detail & Related papers (2022-02-25T20:00:56Z)
Stochastic Physics-Informed Neural Networks (SPINN): A Moment-Matching Framework for Learning Hidden Physics within Stochastic Differential Equations [4.482886054198202]
We propose a framework for training deep neural networks to learn equations that represent hidden physics within differential equations (SDEs) The proposed framework relies on uncertainty propagation and moment-matching techniques along with state-of-the-art deep learning strategies.
arXiv Detail & Related papers (2021-09-03T16:59:12Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
Supervised Learning in the Presence of Concept Drift: A modelling framework [5.22609266390809]
We present a modelling framework for the investigation of supervised learning in non-stationary environments. We model two example types of learning systems: prototype-based Learning Vector Quantization (LVQ) for classification and shallow, layered neural networks for regression tasks.
arXiv Detail & Related papers (2020-05-21T09:13:58Z)
Stochasticity in Neural ODEs: An Empirical Study [68.8204255655161]
Regularization of neural networks (e.g. dropout) is a widespread technique in deep learning that allows for better generalization. We show that data augmentation during the training improves the performance of both deterministic and versions of the same model. However, the improvements obtained by the data augmentation completely eliminate the empirical regularization gains, making the performance of neural ODE and neural SDE negligible.
arXiv Detail & Related papers (2020-02-22T22:12:56Z)
Learning Stochastic Behaviour from Aggregate Data [52.012857267317784]
Learning nonlinear dynamics from aggregate data is a challenging problem because the full trajectory of each individual is not available. We propose a novel method using the weak form of Fokker Planck Equation (FPE) to describe the density evolution of data in a sampled form. In such a sample-based framework we are able to learn the nonlinear dynamics from aggregate data without explicitly solving the partial differential equation (PDE) FPE.
arXiv Detail & Related papers (2020-02-10T03:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.