Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification
- URL: http://arxiv.org/abs/2504.12712v2
- Date: Sat, 26 Apr 2025 07:02:38 GMT
- Title: Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification
- Authors: Hyunji Jung, Hanseul Cho, Chulhee Yun,
- Abstract summary: We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD)<n>We show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution when tasks are jointly separable.<n>We also analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.
- Score: 12.699007098398805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training on a single task is implicitly biased towards the individual max-margin solution for the task, and the direction of the joint max-margin solution can be largely different from these individual solutions. Additionally, when tasks are given in a cyclic order, we present a non-asymptotic analysis on cycle-averaged forgetting, revealing that (1) alignment between tasks is indeed closely tied to catastrophic forgetting and backward knowledge transfer and (2) the amount of forgetting vanishes to zero as the cycle repeats. Lastly, we analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.
Related papers
- Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z) - On Sequential Maximum a Posteriori Inference for Continual Learning [0.0]
We formulate sequential maximum a posteriori inference as a recursion of loss functions and reduce the problem of continual learning to approximating the previous loss function.<n>We propose two coreset-free methods: autodiff quadratic consolidation, which uses an accurate and full quadratic approximation, and neural consolidation, which uses a neural network approximation.<n>We find that neural consolidation performs well in the classical task sequences, where the input dimension is small, while autodiff quadratic consolidation performs consistently well in image task sequences with a fixed pre-trained feature extractor.
arXiv Detail & Related papers (2024-05-26T09:20:47Z) - Continual Learning in Linear Classification on Separable Data [34.78569443156924]
We show that learning with weak regularization reduces to solving a sequential max-margin problem.
We then develop upper bounds on the forgetting and other quantities of interest under various settings.
We discuss several practical implications to popular training practices like regularization scheduling and weighting.
arXiv Detail & Related papers (2023-06-06T09:34:11Z) - Learning Linear Causal Representations from Interventions under General
Nonlinear Mixing [52.66151568785088]
We prove strong identifiability results given unknown single-node interventions without access to the intervention targets.
This is the first instance of causal identifiability from non-paired interventions for deep neural network embeddings.
arXiv Detail & Related papers (2023-06-04T02:32:12Z) - Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly.
The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z) - Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z) - How catastrophic can catastrophic forgetting be in linear regression? [30.702863017223457]
We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks.
We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method.
arXiv Detail & Related papers (2022-05-19T14:28:40Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - A Retrospective Approximation Approach for Smooth Stochastic
Optimization [0.2867517731896504]
Gradient (SG) is the defactorandom iterative technique to solve optimization (SO) problems with a smooth (non-fimation) objective $imation.
arXiv Detail & Related papers (2021-03-07T16:29:36Z) - Competitive Mirror Descent [67.31015611281225]
Constrained competitive optimization involves multiple agents trying to minimize conflicting objectives, subject to constraints.
We propose competitive mirror descent (CMD): a general method for solving such problems based on first order information.
As a special case we obtain a novel competitive multiplicative weights algorithm for problems on the positive cone.
arXiv Detail & Related papers (2020-06-17T22:11:35Z) - A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving
Monotone Inclusions with Application to GANs [0.0]
We introduce a relaxed inertial forward-backward-forward (RIFBF) splitting algorithm for approaching the set of zeros of the sum of a maximally monotone operator and a single-valued monotone and Lipschitz continuous operator.
We illustrate the proposed method by applications to a bilinear saddle point problem, in the context of which we also emphasize the interplay between the inertial and the relaxation parameters.
arXiv Detail & Related papers (2020-03-17T18:52:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.