More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning
- URL: http://arxiv.org/abs/2510.21019v1
- Date: Thu, 23 Oct 2025 21:54:00 GMT
- Title: More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning
- Authors: Wanhao Yu, Zheng Wang, Shuteng Niu, Sen Lin, Li Yang,
- Abstract summary: Zeroth-order (ZO) optimization has gained attention as a memory-efficient alternative to first-order (FO) methods.<n>We show that ZO optimization naturally leads to flatter loss landscapes, which in turn reduce forgetting in continuous learning.<n>This stability comes at a cost of plasticity: due to its imprecise gradient estimates and slower convergence, ZO optimization tends to be less effective than FO in acquiring new task-specific knowledge.<n>We propose ZO-FC, a simple but effective approach that applies ZO optimization to a single adapter-based PEFT module with FO optimized classifier.
- Score: 10.698225972251839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zeroth-order (ZO) optimization has gained attention as a memory-efficient alternative to first-order (FO) methods, particularly in settings where gradient computation is expensive or even impractical. Beyond its memory efficiency, in this work, we investigate ZO optimization for continual learning (CL) as a novel approach to address the plasticity-stability-efficiency trilemma. Through theoretical analysis and empirical evidence, we show that ZO optimization naturally leads to flatter loss landscapes, which in turn reduce forgetting in CL. However, this stability comes at a cost of plasticity: due to its imprecise gradient estimates and slower convergence, ZO optimization tends to be less effective than FO in acquiring new task-specific knowledge, particularly under constrained training budgets. To better understand this trade-off, we conduct a holistic evaluation of ZO optimization applied to various existing CL methods. Our findings reveal that ZO optimization enhances stability but often undermines plasticity, particularly when used with learnable classifiers. Motivated by this insight, we propose ZO-FC, a simple but effective approach that applies ZO optimization to a single adapter-based PEFT module with FO optimized classifier. This design leverages the stability benefits of ZO while preserving the adaptability of FO updates with negligible memory overhead. Experiments demonstrate that ZO-FC achieves an effective balance between stability and plasticity, offering a practical and memory-efficient solution for on-device CL.
Related papers
- Stabilizing Policy Optimization via Logits Convexity [59.242732612484474]
We show that the convexity of the supervised fine-tuning loss with respect to model logits plays a key role in enabling stable training.<n>Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework.
arXiv Detail & Related papers (2026-03-01T07:40:12Z) - Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning [26.486835539215523]
We propose textbfContinual textbfFlatness (textbfC-Flat), a method that promotes flatter loss landscapes tailored for continual learning.<n>C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline.<n>In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion.
arXiv Detail & Related papers (2025-08-26T09:39:09Z) - Optimizers Qualitatively Alter Solutions And We Should Leverage This [62.662640460717476]
Deep Neural Networks (DNNs) can not guarantee convergence to a unique global minimum of the loss when using only local information, such as SGD.<n>We argue that the community should aim at understanding the biases of already existing methods, as well as aim to build new DNNs with the explicit intent of inducing certain properties of the solution.
arXiv Detail & Related papers (2025-07-16T13:33:31Z) - Memory-Efficient Personalization of Text-to-Image Diffusion Models via Selective Optimization Strategies [20.358557194892484]
We propose a selective optimization framework that adaptively chooses between backpropagation on low-resolution images (BP-low) and zeroth-order optimization on high-resolution images (ZO-high)<n>Our method achieves competitive performance while significantly reducing memory consumption, enabling scalable, high-quality on-device personalization without increasing latency inference.
arXiv Detail & Related papers (2025-07-14T08:08:55Z) - SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training [13.180761892449736]
Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs)<n>These methods primarily emphasize memory savings, often overlooking potential acceleration in convergence.<n>In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an norm that employs exact singular value decomposition.<n>We show that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-05-30T16:08:40Z) - KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning [15.81250204481401]
We introduce a kernel-function-based ZO framework aimed at mitigating gradient estimation bias.<n>KerZOO achieves comparable or superior performance to existing ZO baselines.<n>We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.
arXiv Detail & Related papers (2025-05-24T21:56:03Z) - AYLA: Amplifying Gradient Sensitivity via Loss Transformation in Non-Convex Optimization [0.0]
Gradient Descent (SGD) and its variants, such as ADAM, are foundational to deep learning optimization.<n>This paper introduces AYLA, a novel framework that enhances dynamics training.
arXiv Detail & Related papers (2025-04-02T16:31:39Z) - COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs [77.79640601822341]
Large Language Models (LLMs) have demonstrated remarkable success across various domains.<n>Their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit.
arXiv Detail & Related papers (2025-02-24T18:42:19Z) - A Novel Unified Parametric Assumption for Nonconvex Optimization [53.943470475510196]
Non optimization is central to machine learning, but the general framework non convexity enables weak convergence guarantees too pessimistic compared to the other hand.<n>We introduce a novel unified assumption in non convex algorithms.
arXiv Detail & Related papers (2025-02-17T21:25:31Z) - Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension [16.037614012166063]
This paper makes a step towards the systematic design of efficient approximations through the lens of Fisher information matrix (FIM)<n>We show that many state-of-the-art efficient approximations can be viewed as solutions to FIM (under the Frobenius norm) with specific structural assumptions.<n>We propose two design recommendations of practical efficients for LLMs, involving careful selection of structural assumptions to balance generality and efficiency.
arXiv Detail & Related papers (2025-02-11T18:27:19Z) - Constrain Alignment with Sparse Autoencoders [45.131670081186]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability.<n>Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.