Related papers: Towards Minimal Fine-Tuning of VLMs

Towards Minimal Fine-Tuning of VLMs

URL: http://arxiv.org/abs/2512.19219v1
Date: Mon, 22 Dec 2025 10:02:10 GMT
Title: Towards Minimal Fine-Tuning of VLMs
Authors: Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee,
Abstract summary: Image-LoRA is a lightweight parameter efficient fine-tuning recipe for transformer-based vision-language models.<n>Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span.<n>It matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs.
Score: 59.01498204407219
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.

Related papers

MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting [6.335488846185043]
MSLoRA is a backbone-agnostic, parameter-efficient adapter that reweights feature responses rather than re-tuning the underlying backbone.<n>MSLoRA unifies adaptation for both convolutional neural networks (CNNs) and vision transformers (ViTs)
arXiv Detail & Related papers (2025-11-16T00:35:37Z)
One Last Attention for Your Vision-Language Model [42.872184600248914]
We propose textbfRational textbfAdaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning.<n> RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix.<n>Experiments show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.
arXiv Detail & Related papers (2025-07-21T10:35:32Z)
ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints [64.35580479051208]
In previous works, low-rank adapters (LoRA) are randomly with a fixed rank across all attachment points.<n>In this paper, we improve convergence and final performance of LoRA fine-tuning using our proposed data-driven weight initialization method.
arXiv Detail & Related papers (2025-07-09T23:52:31Z)
Zero-Shot Adaptation of Parameter-Efficient Fine-Tuning in Diffusion Models [48.22550575107633]
We introduce ProLoRA, enabling zero-shot adaptation of parameter-efficient fine-tuning in text-to-image diffusion models.<n>ProLoRA transfers pre-trained low-rank adjustments from a source to a target model without additional training data.
arXiv Detail & Related papers (2025-05-29T20:37:04Z)
MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning [5.412348391086257]
We propose MSPLoRA, which introduces Global Shared LoRA, Mid-Level Shared LoRA, and Layer-Specific LoRA to capture global patterns, mid-level features, and fine-grained information.<n> Experiments on various NLP tasks demonstrate that MSPLoRA achieves more efficient adaptation and better performance while significantly reducing the number of trainable parameters.
arXiv Detail & Related papers (2025-03-27T07:01:50Z)
Complementary Subspace Low-Rank Adaptation of Vision-Language Models for Few-Shot Classification [6.801416831975985]
Vision language model (VLM) has been designed for large scale image-text alignment as a pretrained foundation model.<n>Low rank adaptation (LoRA) algorithm has rarely been considered for few shot fine-tuning VLM.<n>We propose the complementary subspace low rank adaptation (Comp-LoRA) method to regularize the catastrophic forgetting problem in few shot VLM finetuning.
arXiv Detail & Related papers (2025-01-25T02:55:34Z)
Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs [76.40876036912537]
Large Language Models (LLMs) demonstrate strong few-shot adaptability without requiring fine-tuning.<n>Current Visual Foundation Models (VFMs) require explicit fine-tuning with sufficient tuning data.<n>We propose a framework, LoRA Recycle, that distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective.
arXiv Detail & Related papers (2024-12-03T07:25:30Z)
Replay-Free Continual Low-Rank Adaptation with Dynamic Memory [62.85596937435928]
We revisit continual learning, which enables pre-trained vision transformers (ViTs) to sequentially fine-tune on new downstream tasks over time.<n>Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning.<n>We propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA)
arXiv Detail & Related papers (2024-11-01T14:28:39Z)
Run LoRA Run: Faster and Lighter LoRA Implementations [50.347242693025336]
LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers. This paper presents the RunLoRA framework for efficient implementations of LoRA. Experiments show up to 28% speedup on language modeling networks.
arXiv Detail & Related papers (2023-12-06T10:54:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.