Related papers: Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

URL: http://arxiv.org/abs/2506.12409v1
Date: Sat, 14 Jun 2025 08:59:19 GMT
Title: Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models
Authors: Ziwei Liu, Borui Kang, Wei Li, Hangjie Yuan, Yanbing Yang, Wenbin Li, Jun Luo, Yifan Zhu, Tao Feng,
Abstract summary: This paper pioneers a systematic exploration of Zeroth-Order (ZO) optimization for vision-language continual learning (VLCL)<n>We first identify the incompatibility of naive full-ZO adoption in VLCL due to modality-specific instability.<n>We develop a layer-wise optimization paradigm that interleaves ZO and FO across network layers, capitalizing on the heterogeneous learning dynamics of shallow versus deep representations.
Score: 44.27801276966812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Continual learning in vision-language models (VLMs) faces critical challenges in balancing parameter efficiency, memory consumption, and optimization stability. While First-Order (FO) optimization (e.g., SGD) dominate current approaches, their deterministic gradients often trap models in suboptimal local minima and incur substantial memory overhead. This paper pioneers a systematic exploration of Zeroth-Order (ZO) optimization for vision-language continual learning (VLCL). We first identify the incompatibility of naive full-ZO adoption in VLCL due to modality-specific instability. To resolve this, we selectively applying ZO to either vision or language modalities while retaining FO in the complementary branch. Furthermore, we develop a layer-wise optimization paradigm that interleaves ZO and FO across network layers, capitalizing on the heterogeneous learning dynamics of shallow versus deep representations. A key theoretical insight reveals that ZO perturbations in vision branches exhibit higher variance than language counterparts, prompting a gradient sign normalization mechanism with modality-specific perturbation constraints. Extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art performance, reducing memory consumption by 89.1% compared to baselines. Code will be available upon publication.

Related papers

SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training [13.180761892449736]
Low-rank gradient-based optimization methods have significantly improved memory efficiency during the training of large language models (LLMs)<n>These methods primarily emphasize memory savings, often overlooking potential acceleration in convergence.<n>In this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an norm that employs exact singular value decomposition.<n>We show that SUMO accelerates convergence, enhances stability, improves performance, and reduces memory requirements by up to 20% compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-05-30T16:08:40Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Understanding Generalization of Federated Learning: the Trade-off between Model Stability and Optimization [22.577751005038543]
Federated Learning (FL) is a distributed learning approach that trains machine learning models across multiple devices.<n>FL often faces challenges due to data heterogeneity, leading to inconsistent local optima among clients.<n>This paper introduces an innovative generalization dynamics analysis framework, named as Libra, for algorithm-dependent excess risk minimization.
arXiv Detail & Related papers (2024-11-25T11:43:22Z)
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
LOBG:Less Overfitting for Better Generalization in Vision-Language Model [19.890629892640206]
We propose a framework named LOBG for vision-language models. We use CLIP to filter out fine-grained foreground information that might cause overfitting, thereby guiding prompts with basic visual concepts. Our method significantly improves generalization capability and alleviates overfitting compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-10-14T08:06:21Z)
Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures [21.18741772731095]
Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values. Existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance. This paper proposes a low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs.
arXiv Detail & Related papers (2024-10-10T08:10:53Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
SING: A Plug-and-Play DNN Learning Technique [25.563053353709627]
We propose SING (StabIlized and Normalized Gradient), a plug-and-play technique that improves the stability and robustness of Adam(W) SING is straightforward to implement and has minimal computational overhead, requiring only a layer-wise standardization of gradients fed to Adam(W)
arXiv Detail & Related papers (2023-05-25T12:39:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.