Explore How to Inject Beneficial Noise in MLLMs
- URL: http://arxiv.org/abs/2511.12917v1
- Date: Mon, 17 Nov 2025 03:11:41 GMT
- Title: Explore How to Inject Beneficial Noise in MLLMs
- Authors: Ruishu Zhu, Sida Huang, Ziheng Jiao, Hongyuan Zhang,
- Abstract summary: Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence.<n>We propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning.
- Score: 10.778199931281485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have played an increasingly important role in multimodal intelligence. However, the existing fine-tuning methods often ignore cross-modal heterogeneity, limiting their full potential. In this work, we propose a novel fine-tuning strategy by injecting beneficial random noise, which outperforms previous methods and even surpasses full fine-tuning, with minimal additional parameters. The proposed Multimodal Noise Generator (MuNG) enables efficient modality fine-tuning by injecting customized noise into the frozen MLLMs. Specifically, we reformulate the reasoning process of MLLMs from a variational inference perspective, upon which we design a multimodal noise generator that dynamically analyzes cross-modal relationships in image-text pairs to generate task-adaptive beneficial noise. Injecting this type of noise into the MLLMs effectively suppresses irrelevant semantic components, leading to significantly improved cross-modal representation alignment and enhanced performance on downstream tasks. Experiments on two mainstream MLLMs, QwenVL and LLaVA, demonstrate that our method surpasses full-parameter fine-tuning and other existing fine-tuning approaches, while requiring adjustments to only about $1\sim2\%$ additional parameters. The relevant code is uploaded in the supplementary.
Related papers
- A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - MokA: Multimodal Low-Rank Adaptation for MLLMs [11.440424554587674]
Multimodal low-rank Adaptation (MokA) is a multimodal-aware efficient fine-tuning strategy.<n>MokA compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction.
arXiv Detail & Related papers (2025-06-05T16:04:08Z) - Enhance Vision-Language Alignment with Noise [59.2608298578913]
We investigate whether the frozen model can be fine-tuned by customized noise.<n>We propose Positive-incentive Noise (PiNI) which can fine-tune CLIP via injecting noise into both visual and text encoders.
arXiv Detail & Related papers (2024-12-14T12:58:15Z) - R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge [78.26352952957909]
Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently.<n>The concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM.<n>In this paper, the problem of enabling edge users to collaboratively craft such MTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks.
arXiv Detail & Related papers (2024-11-27T10:57:06Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training [9.023648972811458]
RagVL is a novel framework with knowledge-enhanced reranking and noise-injected training.
We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability.
For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness.
arXiv Detail & Related papers (2024-07-31T08:43:17Z) - MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource
Visual Question Answering [66.05768870785548]
Finetuning pretrained Vision-Language Models (VLMs) has been a prevailing paradigm for achieving state-of-the-art performance in Visual Question Answering (VQA)
Current parameter-efficient tuning methods dramatically reduce the number of tunable parameters, but there still exists a significant performance gap with full finetuning.
We propose MixPHM, a redundancy-aware parameter-efficient tuning method that outperforms full finetuning in low-resource VQA.
arXiv Detail & Related papers (2023-03-02T13:28:50Z) - NoisyTune: A Little Noise Can Help You Finetune Pretrained Language
Models Better [98.5705258907774]
Finetuning pretrained language models (PLMs) is critical for their success in downstream tasks.
PLMs may have risks in overfitting pretraining signals, and there are gaps between downstream tasks and the pretraining tasks.
NoisyTune can help better finetune PLMs in downstream tasks by adding some noise to the parameters of PLMs before finetuning.
arXiv Detail & Related papers (2022-02-24T11:08:02Z) - Robust Multi-Objective Bayesian Optimization Under Input Noise [27.603887040015888]
In many manufacturing processes, the design parameters are subject to random input noise, resulting in a product that is often less performant than expected.
In this work, we propose the first multi-objective BO method that is robust to input noise.
arXiv Detail & Related papers (2022-02-15T16:33:48Z) - Multiview point cloud registration with anisotropic and space-varying
localization noise [1.5499426028105903]
We address the problem of registering multiple point clouds corrupted with high anisotropic localization noise.
Existing methods are based on an implicit assumption of space-invariant isotropic noise.
We show that our noise handling strategy improves significantly the robustness to high levels of anisotropic noise.
arXiv Detail & Related papers (2022-01-03T15:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.