Related papers: Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

URL: http://arxiv.org/abs/2510.04930v1
Date: Mon, 06 Oct 2025 15:40:36 GMT
Title: Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
Authors: Ali Saheb Pasand, Elvis Dohmatob,
Abstract summary: We show that grokking can be induced by asymmetric speeds of gradient descent.<n>We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed.<n>We establish that this modified method, which we call egalitarian gradient descent (EGD), can be seen as a carefully modified form of natural gradient descent, groks much faster.
Score: 16.00639294713734
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

Related papers

One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware Quantization Training [12.400950982075948]
Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources. Traditional loss-aware quantization methods commonly use the quantized gradient to replace the full-precision gradient. This paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction.
arXiv Detail & Related papers (2024-01-30T05:42:54Z)
Acceleration and Implicit Regularization in Gaussian Phase Retrieval [5.484345596034159]
In this setting, we prove that methods with Polyak or Nesterov momentum implicit regularization ensures a nice convex descent. Experimental evidence shows that these methods converge faster than gradient descent in practice.
arXiv Detail & Related papers (2023-11-21T04:10:03Z)
Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima. We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z)
Nesterov acceleration despite very noisy gradients [2.048226951354646]
We present a generalization of Nesterov's accelerated gradient descent algorithm. AGNES achieves acceleration for smooth convex and strongly convex minimization tasks.
arXiv Detail & Related papers (2023-02-10T21:32:47Z)
Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method [17.275654092947647]
We introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.
arXiv Detail & Related papers (2022-06-14T14:55:27Z)
On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient. Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent [1.7874193862154875]
We propose a decreasing scaling transition from adaptive gradient descent to gradient descent method DSTAda. Our experimental results show that DSTAda has a faster speed, higher accuracy, and better stability and robustness.
arXiv Detail & Related papers (2021-06-12T11:28:58Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z)
Regularizing Meta-Learning via Gradient Dropout [102.29924160341572]
meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize. We introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning.
arXiv Detail & Related papers (2020-04-13T10:47:02Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.