Related papers: Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification

Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification

URL: http://arxiv.org/abs/2504.12712v2
Date: Sat, 26 Apr 2025 07:02:38 GMT
Title: Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification
Authors: Hyunji Jung, Hanseul Cho, Chulhee Yun,
Abstract summary: We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD)<n>We show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution when tasks are jointly separable.<n>We also analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.
Score: 12.699007098398805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training on a single task is implicitly biased towards the individual max-margin solution for the task, and the direction of the joint max-margin solution can be largely different from these individual solutions. Additionally, when tasks are given in a cyclic order, we present a non-asymptotic analysis on cycle-averaged forgetting, revealing that (1) alignment between tasks is indeed closely tied to catastrophic forgetting and backward knowledge transfer and (2) the amount of forgetting vanishes to zero as the cycle repeats. Lastly, we analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.

Related papers

Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning [59.6658995479243]
We propose texttext-Perturb-and-Merge (P&M), a novel continual learning framework that integrates model merging into the CL paradigm to avoid forgetting.<n>Through theoretical analysis, we minimize the total loss increase across all tasks and derive an analytical solution for the optimal merging coefficient.<n>Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
arXiv Detail & Related papers (2025-05-28T14:14:19Z)
Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
On Sequential Maximum a Posteriori Inference for Continual Learning [0.0]
We formulate sequential maximum a posteriori inference as a recursion of loss functions and reduce the problem of continual learning to approximating the previous loss function.<n>We propose two coreset-free methods: autodiff quadratic consolidation, which uses an accurate and full quadratic approximation, and neural consolidation, which uses a neural network approximation.<n>We find that neural consolidation performs well in the classical task sequences, where the input dimension is small, while autodiff quadratic consolidation performs consistently well in image task sequences with a fixed pre-trained feature extractor.
arXiv Detail & Related papers (2024-05-26T09:20:47Z)
Continual Learning in Linear Classification on Separable Data [34.78569443156924]
We show that learning with weak regularization reduces to solving a sequential max-margin problem. We then develop upper bounds on the forgetting and other quantities of interest under various settings. We discuss several practical implications to popular training practices like regularization scheduling and weighting.
arXiv Detail & Related papers (2023-06-06T09:34:11Z)
Learning Linear Causal Representations from Interventions under General Nonlinear Mixing [52.66151568785088]
We prove strong identifiability results given unknown single-node interventions without access to the intervention targets. This is the first instance of causal identifiability from non-paired interventions for deep neural network embeddings.
arXiv Detail & Related papers (2023-06-04T02:32:12Z)
Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly. The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z)
Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set. For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z)
How catastrophic can catastrophic forgetting be in linear regression? [30.702863017223457]
We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method.
arXiv Detail & Related papers (2022-05-19T14:28:40Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
A Retrospective Approximation Approach for Smooth Stochastic Optimization [0.2867517731896504]
Gradient (SG) is the defactorandom iterative technique to solve optimization (SO) problems with a smooth (non-fimation) objective $imation.
arXiv Detail & Related papers (2021-03-07T16:29:36Z)
Precise High-Dimensional Asymptotics for Quantifying Heterogeneous Transfers [66.66228496844191]
We show when does combining the samples from two related tasks perform better than learning with one target task alone?<n>This question is motivated by an empirical phenomenon known as negative transfer, which has been observed in practice.<n>We illustrate these results in a random-effects model to mathematically prove a phase transition from positive to negative transfer as the number of source task samples increases.
arXiv Detail & Related papers (2020-10-22T14:14:20Z)
Competitive Mirror Descent [67.31015611281225]
Constrained competitive optimization involves multiple agents trying to minimize conflicting objectives, subject to constraints. We propose competitive mirror descent (CMD): a general method for solving such problems based on first order information. As a special case we obtain a novel competitive multiplicative weights algorithm for problems on the positive cone.
arXiv Detail & Related papers (2020-06-17T22:11:35Z)
A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving Monotone Inclusions with Application to GANs [0.0]
We introduce a relaxed inertial forward-backward-forward (RIFBF) splitting algorithm for approaching the set of zeros of the sum of a maximally monotone operator and a single-valued monotone and Lipschitz continuous operator. We illustrate the proposed method by applications to a bilinear saddle point problem, in the context of which we also emphasize the interplay between the inertial and the relaxation parameters.
arXiv Detail & Related papers (2020-03-17T18:52:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.