Towards Minimal Fine-Tuning of VLMs
- URL: http://arxiv.org/abs/2512.19219v1
- Date: Mon, 22 Dec 2025 10:02:10 GMT
- Title: Towards Minimal Fine-Tuning of VLMs
- Authors: Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee,
- Abstract summary: Image-LoRA is a lightweight parameter efficient fine-tuning recipe for transformer-based vision-language models.<n>Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span.<n>It matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs.
- Score: 59.01498204407219
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.
Related papers
- MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting [6.335488846185043]
MSLoRA is a backbone-agnostic, parameter-efficient adapter that reweights feature responses rather than re-tuning the underlying backbone.<n>MSLoRA unifies adaptation for both convolutional neural networks (CNNs) and vision transformers (ViTs)
arXiv Detail & Related papers (2025-11-16T00:35:37Z) - One Last Attention for Your Vision-Language Model [42.872184600248914]
We propose textbfRational textbfAdaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning.<n> RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix.<n>Experiments show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.
arXiv Detail & Related papers (2025-07-21T10:35:32Z) - ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints [64.35580479051208]
In previous works, low-rank adapters (LoRA) are randomly with a fixed rank across all attachment points.<n>In this paper, we improve convergence and final performance of LoRA fine-tuning using our proposed data-driven weight initialization method.
arXiv Detail & Related papers (2025-07-09T23:52:31Z) - Zero-Shot Adaptation of Parameter-Efficient Fine-Tuning in Diffusion Models [48.22550575107633]
We introduce ProLoRA, enabling zero-shot adaptation of parameter-efficient fine-tuning in text-to-image diffusion models.<n>ProLoRA transfers pre-trained low-rank adjustments from a source to a target model without additional training data.
arXiv Detail & Related papers (2025-05-29T20:37:04Z) - MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning [5.412348391086257]
We propose MSPLoRA, which introduces Global Shared LoRA, Mid-Level Shared LoRA, and Layer-Specific LoRA to capture global patterns, mid-level features, and fine-grained information.<n> Experiments on various NLP tasks demonstrate that MSPLoRA achieves more efficient adaptation and better performance while significantly reducing the number of trainable parameters.
arXiv Detail & Related papers (2025-03-27T07:01:50Z) - Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification [6.801416831975985]
Vision language model (VLM) has been designed for large scale image-text alignment as a pretrained foundation model.<n>Low rank adaptation (LoRA) algorithm has rarely been considered for few shot fine-tuning VLM.<n>We propose the complementary subspace low rank adaptation (Comp-LoRA) method to regularize the catastrophic forgetting problem in few shot VLM finetuning.
arXiv Detail & Related papers (2025-01-25T02:55:34Z) - Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs [76.40876036912537]
Large Language Models (LLMs) demonstrate strong few-shot adaptability without requiring fine-tuning.<n>Current Visual Foundation Models (VFMs) require explicit fine-tuning with sufficient tuning data.<n>We propose a framework, LoRA Recycle, that distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective.
arXiv Detail & Related papers (2024-12-03T07:25:30Z) - Replay-Free Continual Low-Rank Adaptation with Dynamic Memory [62.85596937435928]
We revisit continual learning, which enables pre-trained vision transformers (ViTs) to sequentially fine-tune on new downstream tasks over time.<n>Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning.<n>We propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA)
arXiv Detail & Related papers (2024-11-01T14:28:39Z) - Run LoRA Run: Faster and Lighter LoRA Implementations [50.347242693025336]
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers.
This paper presents the RunLoRA framework for efficient implementations of LoRA.
Experiments show up to 28% speedup on language modeling networks.
arXiv Detail & Related papers (2023-12-06T10:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.