Related papers: Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

URL: http://arxiv.org/abs/2503.02844v2
Date: Thu, 06 Mar 2025 00:17:08 GMT
Title: Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
Authors: Vaibhav Singh, Paul Janson, Paria Mehrbod, Adam Ibrahim, Irina Rish, Eugene Belilovsky, Benjamin Thérien,
Abstract summary: Unlabeled data presents both opportunities and challenges for training artificial intelligence systems.<n>While self-supervised learning has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge.<n>In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative.
Score: 17.371995357694736
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

Related papers

Bayesian continual learning and forgetting in neural networks [0.8795040582681392]
We introduce Metaplasticity from Synaptic Uncertainty (MESU) MESU is a Bayesian framework that updates network parameters according their uncertainty. Our results unify ideas from metaplasticity, Bayesian inference, and Hessian-based regularization.
arXiv Detail & Related papers (2025-04-18T09:11:34Z)
Sample Compression for Self Certified Continual Learning [4.354838732412981]
Continual learning algorithms aim to learn from a sequence of tasks, making the training distribution non-stationary.<n>We present a new method called Continual Pick-to-Learn (CoP2L), which is able to retain the most representative samples for each task in an efficient way.
arXiv Detail & Related papers (2025-03-13T16:05:56Z)
SegACIL: Solving the Stability-Plasticity Dilemma in Class-Incremental Semantic Segmentation [12.315674474349956]
We propose SegACIL, a novel continual learning method for semantic segmentation based on a linear closed-form solution.<n>Unlike traditional methods that require multiple epochs for training, SegACIL only requires a single epoch.<n>Experiments on the Pascal VOC2012 dataset show that SegACIL achieves superior performance in the sequential, disjoint, and overlap settings.
arXiv Detail & Related papers (2024-12-14T13:39:56Z)
Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z)
Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [93.90047628101155]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks. To address this, some methods propose replaying data from previous tasks during new task learning. However, it is not expected in practice due to memory constraints and data privacy issues.
arXiv Detail & Related papers (2024-01-12T12:51:12Z)
Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization [51.34904967046097]
Continual learning seeks to overcome the challenge of catastrophic forgetting, where a model forgets previously learnt information. We introduce a novel prior-based method that better constrains parameter growth, reducing catastrophic forgetting. Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments.
arXiv Detail & Related papers (2023-09-15T17:10:51Z)
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D. The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z)
Large-scale Pre-trained Models are Surprisingly Strong in Incremental Novel Class Discovery [76.63807209414789]
We challenge the status quo in class-iNCD and propose a learning paradigm where class discovery occurs continuously and truly unsupervisedly. We propose simple baselines, composed of a frozen PTM backbone and a learnable linear classifier, that are not only simple to implement but also resilient under longer learning scenarios.
arXiv Detail & Related papers (2023-03-28T13:47:16Z)
Effective and Efficient Training for Sequential Recommendation using Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective. We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z)
The Challenges of Continuous Self-Supervised Learning [40.941767578622745]
Self-supervised learning (SSL) aims to eliminate one of the major bottlenecks in representation learning - the need for human annotations. We show that a direct application of current methods to such continuous setup is inefficient both computationally and in the amount of data required. We propose the use of replay buffers as an approach to alleviate the issues of inefficiency and temporal correlations.
arXiv Detail & Related papers (2022-03-23T20:05:06Z)
An Empirical Investigation of the Role of Pre-training in Lifelong Learning [21.995593026269578]
We show that generic pre-training implicitly alleviates the effects of catastrophic forgetting when learning multiple tasks sequentially. We study this phenomenon by analyzing the loss landscape, finding that pre-trained weights appear to ease forgetting by leading to wider minima.
arXiv Detail & Related papers (2021-12-16T19:00:55Z)
Self-Supervised Training Enhances Online Continual Learning [37.91734641808391]
In continual learning, a system must incrementally learn from a non-stationary data stream without catastrophic forgetting. Self-supervised pre-training could yield features that generalize better than supervised learning. Our best system achieves a 14.95% relative increase in top-1 accuracy on class incremental ImageNet over the prior state of the art for online continual learning.
arXiv Detail & Related papers (2021-03-25T17:45:27Z)
Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates [52.164757178369804]
Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget. We conduct an empirical study of various Bayesian uncertainty estimation methods and Monte Carlo dropout options for deep pre-trained models in the active learning framework. We also demonstrate that to acquire instances during active learning, a full-size Transformer can be substituted with a distilled version, which yields better computational performance.
arXiv Detail & Related papers (2021-01-20T13:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.