Related papers: SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

URL: http://arxiv.org/abs/2506.20990v1
Date: Thu, 26 Jun 2025 04:07:14 GMT
Title: SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
Authors: Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang,
Abstract summary: Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks.<n>It requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices.<n>We propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning.
Score: 18.727093839777755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO) optimization, and often fail to achieve satisfactory performance. In this paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning via a sharpness-aware warm-up training. SharpZO features a two-stage optimization process: a sharpness-aware ES stage that globally explores and smooths the loss landscape to construct a strong initialization, followed by a fine-grained local search via sparse ZO optimization. The entire optimization relies solely on forward passes. Detailed theoretical analysis and extensive experiments on CLIP models demonstrate that SharpZO significantly improves accuracy and convergence speed, achieving up to 7% average gain over state-of-the-art forward-only methods.

Related papers

Divergence Minimization Preference Optimization for Diffusion Model Alignment [58.651951388346525]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>Our results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques.<n>DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Two-Stage Random Alternation Framework for One-Shot Pansharpening [12.385955231193675]
We introduce a two-stage random alternating framework (TRA-PAN) that performs instance-specific optimization for any given Multispectral(MS)/Panchromatic(PAN) pair.<n>TRA-PAN effectively integrates strong supervision constraints from reduced-resolution images with the physical characteristics of the full-resolution images.<n> Experimental results demonstrate that TRA-PAN outperforms state-of-the-art (SOTA) methods in quantitative metrics and visual quality in real-world scenarios.
arXiv Detail & Related papers (2025-05-10T09:26:22Z)
Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning [37.507489928116804]
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory.<n>We introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization.<n>We propose textbfDivergence-driven textbfZeroth-textbfOrder (textbfDiZO) optimization.
arXiv Detail & Related papers (2025-02-05T16:03:17Z)
PROFIT: A Specialized Optimizer for Deep Fine Tuning [9.082267858686933]
We present PROFIT (Prolly Restricted For Iterative Training), one of the firsts specifically designed for incrementally fine-tuning converged models on new tasks or datasets.<n>By employing a simple temporalization process, PROFIT outperforms traditional fine-tuning methods across various tasks.<n> PROFIT is encapsulated within the logic, making it easily integrated into any training pipeline with minimal engineering effort.
arXiv Detail & Related papers (2024-12-02T19:37:34Z)
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning [81.85515625591884]
Stepwise Diffusion Policy Optimization (SDPO) is an alignment method tailored for few-step diffusion models. SDPO incorporates dense reward feedback at every intermediate step to ensure consistent alignment across all denoising steps. SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations.
arXiv Detail & Related papers (2024-11-18T16:57:41Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models [88.80146574509195]
Quantization is a promising approach for reducing memory overhead and accelerating inference. We propose a novel-aware quantization (ZSAQ) framework for the zero-shot quantization of various PLMs.
arXiv Detail & Related papers (2023-10-20T07:09:56Z)
Gradient constrained sharpness-aware prompt learning for vision-language models [99.74832984957025]
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM) By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness. We propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp)
arXiv Detail & Related papers (2023-09-14T17:13:54Z)
Break a Lag: Triple Exponential Moving Average for Enhanced Optimization [2.0199251985015434]
We introduce Fast Adaptive Moment Estimation (FAME), a novel optimization technique that leverages the power of Triple Exponential Moving Average.<n>FAME enhances responsiveness to data dynamics, mitigates trend identification lag, and optimize learning efficiency.<n>Our comprehensive evaluation encompasses different computer vision tasks including image classification, object detection, and semantic segmentation, integrating FAME into 30 distinct architectures.
arXiv Detail & Related papers (2023-06-02T10:29:33Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.