Related papers: Grokking in Linear Models for Logistic Regression

Grokking in Linear Models for Logistic Regression

URL: http://arxiv.org/abs/2602.08302v1
Date: Mon, 09 Feb 2026 06:16:43 GMT
Title: Grokking in Linear Models for Logistic Regression
Authors: Nataraj Das, Atreya Vedantam, Chandrashekar Lakshminarayanan,
Abstract summary: Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks.<n>We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin.
Score: 0.9332987715848714
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.

Related papers

InfoNCE Induces Gaussian Distribution [7.8922077372145685]
A loss in contrastive training is InfoNCE and its variants.<n>We show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training.<n>The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.
arXiv Detail & Related papers (2026-02-27T13:35:58Z)
Generalization Below the Edge of Stability: The Role of Data Geometry [60.147710896851045]
We show how data geometry controls generalization in ReLU networks trained below the edge of stability.<n>For data distributions supported on a mixture of low-dimensional balls, we derive generalization bounds that provably adapt to the intrinsic dimension.<n>Our results consolidate disparate empirical findings that have appeared in the literature.
arXiv Detail & Related papers (2025-10-20T21:40:36Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Graph Out-of-Distribution Generalization with Controllable Data Augmentation [51.17476258673232]
Graph Neural Network (GNN) has demonstrated extraordinary performance in classifying graph properties. Due to the selection bias of training and testing data, distribution deviation is widespread. We propose OOD calibration to measure the distribution deviation of virtual samples.
arXiv Detail & Related papers (2023-08-16T13:10:27Z)
Learning Linear Causal Representations from Interventions under General Nonlinear Mixing [52.66151568785088]
We prove strong identifiability results given unknown single-node interventions without access to the intervention targets. This is the first instance of causal identifiability from non-paired interventions for deep neural network embeddings.
arXiv Detail & Related papers (2023-06-04T02:32:12Z)
Gradient flow in the gaussian covariate model: exact solution of learning curves and multiple descent structures [14.578025146641806]
We provide a full and unified analysis of the whole time-evolution of the generalization curve. We show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets.
arXiv Detail & Related papers (2022-12-13T17:39:18Z)
Spectral Evolution and Invariance in Linear-width Neural Networks [8.419660614226816]
We investigate the spectral properties of linear-width feed-forward neural networks. We show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel exhibit heavy tail behavior.
arXiv Detail & Related papers (2022-11-11T23:00:30Z)
Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data [39.53312099194621]
We consider the generalization error of two-layer neural networks trained to generalize by gradient descent.<n>We show that neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error.<n>In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
arXiv Detail & Related papers (2022-02-11T23:04:00Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
Model, sample, and epoch-wise descents: exact solution of gradient flow in the random feature model [16.067228939231047]
We analyze the whole temporal behavior of the generalization and training errors under gradient flow. We show that in the limit of large system size the full time-evolution path of both errors can be calculated analytically. Our techniques are based on Cauchy complex integral representations of the errors together with recent random matrix methods based on linear pencils.
arXiv Detail & Related papers (2021-10-22T14:25:54Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.