Evolutionary Strategies lead to Catastrophic Forgetting in LLMs
- URL: http://arxiv.org/abs/2601.20861v1
- Date: Wed, 28 Jan 2026 18:59:34 GMT
- Title: Evolutionary Strategies lead to Catastrophic Forgetting in LLMs
- Authors: Immanuel Abdi, Akshat Gupta, Micah Mok, Alexander Lu, Nicholas Lee, Gopala Anumanchipalli,
- Abstract summary: Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms.<n>ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget.<n>ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online.
- Score: 51.91763220981834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger $\ell_2$ norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.
Related papers
- Sequencing to Mitigate Catastrophic Forgetting in Continual Learning [1.1724961392643483]
Catastrophic forgetting (CF) is a major challenge to the progress of Continual Learning approaches.<n>We consider the role of task sequencing in mitigating CF and propose a method for determining the optimal task order.<n>Results demonstrate that intelligent task sequencing can substantially reduce CF.
arXiv Detail & Related papers (2025-12-18T18:40:58Z) - EA4LLM: A Gradient-Free Approach to Large Language Model Optimization via Evolutionary Algorithms [23.009274904878065]
We propose EA4LLM, an evolutionary algorithm for optimizing large language models (LLMs)<n>We empirically verify full- parameter optimization from the pretraining stage across model sizes ranging from 0.5B to 32B.<n>Our work challenges the prevailing assumption that gradient-based optimization is the only viable approach for training neural networks.
arXiv Detail & Related papers (2025-10-12T13:38:28Z) - Prospective Learning in Retrospect [24.17160211422211]
Probably Approximately Correct (PAC) learning framework fails to account for dynamic data distributions and evolving objectives.<n>We present preliminary results that improve the algorithm and numerical results, and extend prospective learning to sequential decision-making scenarios.
arXiv Detail & Related papers (2025-07-10T17:45:15Z) - LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z) - Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank Adaptations [0.0]
This study proposes the methodology LLIAM, a straightforward adaptation of a kind of FM, Large Language Models, for the Time Series Forecasting task.<n>A comparison was made between the performance of LLIAM and different state-of-the-art DL algorithms, including Recurrent Neural Networks and Temporal Convolutional Networks, as well as a LLM-based method, TimeLLM.<n>The outcomes of this investigation demonstrate the efficacy of LLIAM, highlighting that this straightforward and general approach can attain competent results without the necessity for applying complex modifications.
arXiv Detail & Related papers (2024-10-15T12:14:01Z) - Landscape-Aware Growing: The Power of a Little LAG [49.897766925371485]
We study the question of how to select the best growing strategy from a given pool of growing strategies.
We present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)"
arXiv Detail & Related papers (2024-06-04T16:38:57Z) - Clustering-based Domain-Incremental Learning [4.835091081509403]
Key challenge in continual learning is the so-called "catastrophic forgetting problem"
We propose an online clustering-based approach on a dynamically updated finite pool of samples or gradients.
We demonstrate the effectiveness of the proposed strategy and its promising performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-09-21T13:49:05Z) - Efficient Feature Transformations for Discriminative and Generative
Continual Learning [98.10425163678082]
We propose a simple task-specific feature map transformation strategy for continual learning.
Theses provide powerful flexibility for learning new tasks, achieved with minimal parameters added to the base architecture.
We demonstrate the efficacy and efficiency of our method with an extensive set of experiments in discriminative (CIFAR-100 and ImageNet-1K) and generative sequences of tasks.
arXiv Detail & Related papers (2021-03-25T01:48:14Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.