Related papers: Are Greedy Task Orderings Better Than Random in Continual Linear Regression?

Are Greedy Task Orderings Better Than Random in Continual Linear Regression?

URL: http://arxiv.org/abs/2510.19941v1
Date: Wed, 22 Oct 2025 18:09:59 GMT
Title: Are Greedy Task Orderings Better Than Random in Continual Linear Regression?
Authors: Matan Tsipory, Ran Levinstein, Itay Evron, Mark Kong, Deanna Needell, Daniel Soudry,
Abstract summary: We analyze task orderings in continual learning for linear regression.<n>We focus on orderings that greedily maximize dissimilarity between consecutive tasks.<n>We show that greedy orderings converge faster than random ones in terms of the average loss across tasks.
Score: 23.706463629642155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We analyze task orderings in continual learning for linear regression, assuming joint realizability of training data. We focus on orderings that greedily maximize dissimilarity between consecutive tasks, a concept briefly explored in prior work but still surrounded by open questions. Using tools from the Kaczmarz method literature, we formalize such orderings and develop geometric and algebraic intuitions around them. Empirically, we demonstrate that greedy orderings converge faster than random ones in terms of the average loss across tasks, both for linear regression with random data and for linear probing on CIFAR-100 classification tasks. Analytically, in a high-rank regression setting, we prove a loss bound for greedy orderings analogous to that of random ones. However, under general rank, we establish a repetition-dependent separation. Specifically, while prior work showed that for random orderings, with or without replacement, the average loss after $k$ iterations is bounded by $\mathcal{O}(1/\sqrt{k})$, we prove that single-pass greedy orderings may fail catastrophically, whereas those allowing repetition converge at rate $\mathcal{O}(1/\sqrt[3]{k})$. Overall, we reveal nuances within and between greedy and random orderings.

Related papers

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning [61.07540493350384]
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth and the teacher's own predictions.<n>We show that for any prediction risk, the optimally mixed student improves upon the ridge teacher for every regularization level.<n>We propose a consistent one-shot tuning method to estimate $star$ without grid search, sample splitting, or refitting.
arXiv Detail & Related papers (2026-02-19T17:21:15Z)
CAIRO: Decoupling Order from Scale in Regression [13.755937210012883]
We propose a framework that decouples regression into two distinct stages.<n>In the first stage, we learn a scoring function by minimizing a scale-invariant ranking loss.<n>In the second, we recover the target scale via isotonic regression.
arXiv Detail & Related papers (2026-02-16T03:50:05Z)
Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification [12.699007098398805]
We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD)<n>We show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution when tasks are jointly separable.<n>We also analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.
arXiv Detail & Related papers (2025-04-17T07:35:48Z)
From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Robust Capped lp-Norm Support Vector Ordinal Regression [85.84718111830752]
Ordinal regression is a specialized supervised problem where the labels show an inherent order. Support Vector Ordinal Regression, as an outstanding ordinal regression model, is widely used in many ordinal regression tasks. We introduce a new model, Capped $ell_p$-Norm Support Vector Ordinal Regression(CSVOR), that is robust to outliers.
arXiv Detail & Related papers (2024-04-25T13:56:05Z)
SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences. We formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset. Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z)
How catastrophic can catastrophic forgetting be in linear regression? [30.702863017223457]
We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method.
arXiv Detail & Related papers (2022-05-19T14:28:40Z)
Multi-task Representation Learning with Stochastic Linear Bandits [29.8208189270894]
We study the problem of transfer-learning in the setting of linear bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting.
arXiv Detail & Related papers (2022-02-21T09:26:34Z)
Discovering Non-monotonic Autoregressive Orderings with Variational Inference [67.27561153666211]
We develop an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data. We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass. Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
arXiv Detail & Related papers (2021-10-27T16:08:09Z)
MLE-guided parameter search for task loss minimization in neural sequence modeling [83.83249536279239]
Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks. We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient. Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
arXiv Detail & Related papers (2020-06-04T22:21:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.