Related papers: Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime

URL: http://arxiv.org/abs/2510.07554v1
Date: Wed, 08 Oct 2025 21:09:32 GMT
Title: Phase Diagram of Dropout for Two-Layer Neural Networks in the Mean-Field Regime
Authors: Lénaïc Chizat, Pierre Marion, Yerkin Yesbay,
Abstract summary: Dropout is a training technique for neural networks that consists of randomly deactivating units at each step of their gradient-based training.<n>We study the large-widths of gradient descent with dropout on two-layer neural networks with the mean-field scale.
Score: 20.21806452403366
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dropout is a standard training technique for neural networks that consists of randomly deactivating units at each step of their gradient-based training. It is known to improve performance in many settings, including in the large-scale training of language or vision models. As a first step towards understanding the role of dropout in large neural networks, we study the large-width asymptotics of gradient descent with dropout on two-layer neural networks with the mean-field initialization scale. We obtain a rich asymptotic phase diagram that exhibits five distinct nondegenerate phases depending on the relative magnitudes of the dropout rate, the learning rate, and the width. Notably, we find that the well-studied "penalty" effect of dropout only persists in the limit with impractically small learning rates of order $O(1/\text{width})$. For larger learning rates, this effect disappears and in the limit, dropout is equivalent to a "random geometry" technique, where the gradients are thinned randomly after the forward and backward pass have been computed. In this asymptotic regime, the limit is described by a mean-field jump process where the neurons' update times follow independent Poisson or Bernoulli clocks (depending on whether the learning rate vanishes or not). For some of the phases, we obtain a description of the limit dynamics both in path-space and in distribution-space. The convergence proofs involve a mix of tools from mean-field particle systems and stochastic processes. Together, our results lay the groundwork for a renewed theoretical understanding of dropout in large-scale neural networks.

Related papers

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions [51.68215326304272]
We show that even small perturbations reliably cause otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time.<n>Our findings provide insights into neural network training stability, with practical implications for fine-tuning, model merging, and diversity of model ensembles.
arXiv Detail & Related papers (2025-06-16T08:35:16Z)
Analytic theory of dropout regularization [4.939584372342451]
Dropout is a regularization technique widely used in training artificial neural networks.<n>We analytically study dropout in two-layer neural networks trained with online gradient descent.
arXiv Detail & Related papers (2025-05-12T17:45:02Z)
Implicit regularization of deep residual networks towards neural ODEs [8.075122862553359]
We establish an implicit regularization of deep residual networks towards neural ODEs. We prove that if the network is as a discretization of a neural ODE, then such a discretization holds throughout training.
arXiv Detail & Related papers (2023-09-03T16:35:59Z)
Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network. We provide analytical expressions for these speed limits for linear and linearizable neural networks. Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers. A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z)
Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit [0.0]
Large-width dynamics has emerged as a fruitful viewpoint and led to practical insights on real-world deep networks. For two-layer neural networks, it has been understood that the nature of the trained model radically changes depending on the scale of the initial random weights. We propose various methods to avoid this trivial behavior and analyze in detail the resulting dynamics.
arXiv Detail & Related papers (2021-10-29T07:53:35Z)
Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance [0.0]
In general, neural networks are trained by gradient type optimization methods. The loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down. The present work aims to identify and quantify the root causes of plateau phenomenon.
arXiv Detail & Related papers (2020-07-14T17:33:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.