Related papers: Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

URL: http://arxiv.org/abs/2505.19227v1
Date: Sun, 25 May 2025 16:43:51 GMT
Title: Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law
Authors: Frederik Kunstner, Francis Bach,
Abstract summary: Recent works have highlighted difficulties faced by gradient descent in training the first and last layers of transformer-based language models.<n>These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data.<n>We show that the problem is more difficult when the data have heavier tails.
Score: 4.6193503399184275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$th most frequent word $\pi_k$ is proportional to $1/k$, following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law $\pi_k \propto 1/k^\alpha$ parameterized by the exponent $\alpha > 0$. We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent $\alpha$. Existing theoretical investigations in scaling laws assume that the eigenvalues of the data decay as a power law with exponent $\alpha > 1$. This assumption effectively makes the problem ``finite dimensional'' as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case $\alpha = 1$ as found in text data is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement for large vocabularies.

Related papers

Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension [36.3266119975955]
We consider the optimization problem of minimizing the logistic gradient with descent to train a linear model for binary classification with separable data.<n>We show that GD with a sufficiently large learning rate $$ finds a point with loss smaller than $mathcalO (1/(T))$, as long as $T geq (n/+ 1/2)$, where $n$ is the dataset size.
arXiv Detail & Related papers (2026-02-12T22:58:18Z)
LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution [62.830878652285406]
Training data attribution identifies which training examples most influenced a model's prediction.<n>LoRIF exploits low-rank structures of gradient to address both bottlenecks.<n>On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$times$ storage reduction and query-time speedup.
arXiv Detail & Related papers (2026-01-29T16:18:34Z)
On the Entropy Calibration of Language Models [52.47557449370603]
We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text.<n>We find that the observed scaling behavior is similar to what is predicted by the simplified setting.<n>We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
arXiv Detail & Related papers (2025-11-15T00:33:03Z)
A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws [16.542320113452423]
We show that a neural network can be compressed to polylogarithmic width while preserving its learning dynamics.<n>We also show that a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged.
arXiv Detail & Related papers (2025-10-01T04:35:23Z)
Improved Scaling Laws in Linear Regression via Data Reuse [30.68341507745991]
We show that data reuse can improve existing scaling laws in linear regression.<n>This suggests an improved scaling law via data reuse (i.e., choosing $L>N$) in data-constrained regimes.
arXiv Detail & Related papers (2025-06-10T03:39:29Z)
Attention with Trained Embeddings Provably Selects Important Tokens [73.77633297039097]
Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding remains limited.<n>Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent.<n>Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.
arXiv Detail & Related papers (2025-05-22T21:00:09Z)
Nearly Optimal Differentially Private ReLU Regression [18.599299269974498]
We investigate one of the most fundamental non learning problems, ReLU regression, in the Differential Privacy (DP) model.<n>We show that it is possible to achieve an upper bound of $TildeO(fracd2N2 varepsilon2N2 varepsilon2N2 varepsilon2N2 varepsilon2N2 varepsilon2N2 varepsilon2N2 vareps
arXiv Detail & Related papers (2025-03-08T02:09:47Z)
Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.<n>Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.<n>No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z)
Solving Empirical Bayes via Transformers [18.654470796004265]
This work applies modern AI tools (transformers) to solving one of the oldest statistical problems: Poisson means under empirical Bayes (Poisson-EB) setting.<n>A transformer model is pre-trained on a set of synthetically generated pairs $(X,theta)$ and learns to do in-context learning (ICL) by adapting to unknown $pi$.
arXiv Detail & Related papers (2025-02-14T01:06:15Z)
Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup. We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data. We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z)
A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions. We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$. We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
A Neural Scaling Law from the Dimension of the Data Manifold [8.656787568717252]
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L propto N-alpha$ in the number of network parameters $N$. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $alpha approx 4/d$ for cross-entropy and mean-squared error losses.
arXiv Detail & Related papers (2020-04-22T19:16:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.