Related papers: Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

URL: http://arxiv.org/abs/2512.23073v1
Date: Sun, 28 Dec 2025 20:41:22 GMT
Title: Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models
Authors: Mingyuan Zhang, Yue Bai, Yifan Wang, Yiyang Huang, Yun Fu,
Abstract summary: Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models.<n>We show that MFT consistently surpasses LoRA variants and even full fine-tuning.<n>Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model's existing knowledge.
Score: 44.50699778141182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model's existing knowledge. Code available at: https://github.com/Ming-K9/MFT-VLM

Related papers

ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models [10.17362679822278]
Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge.<n>We introduce ABBA, a new PEFT architecture that re parameterizes the update as a Hadamard product of two independently learnable low-rank matrices.<n>In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely.
arXiv Detail & Related papers (2025-05-20T11:43:25Z)
Shadow-FT: Tuning Instruct Model via Training on Paired Base Model [67.20706292627106]
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks.<n>We propose a novel Shadow-FT framework to tune the Instruct models by leveraging the corresponding Base models.<n>Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance.
arXiv Detail & Related papers (2025-05-19T05:16:21Z)
Boosting Large Language Models with Mask Fine-Tuning [60.56962908455601]
We introduce Mask Fine-Tuning (MFT) to show that properly breaking the integrity of the model can surprisingly lead to improved performance.<n>Experiments show that MFT gains a consistent performance boost across various domains and backbones.
arXiv Detail & Related papers (2025-03-27T20:17:57Z)
Large Language Diffusion Models [93.26422905620008]
Large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs)<n>We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm.<n>Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z)
ReFT: Representation Finetuning for Language Models [74.51093640257892]
We develop a family of Representation Finetuning (ReFT) methods. ReFTs operate on a frozen base model and learn task-specific interventions on hidden representations. We showcase LoReFT on eight commonsense reasoning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE.
arXiv Detail & Related papers (2024-04-04T17:00:37Z)
MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.<n>MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.<n>We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.