Evaluating and Steering Modality Preferences in Multimodal Large Language Model
- URL: http://arxiv.org/abs/2505.20977v1
- Date: Tue, 27 May 2025 10:07:59 GMT
- Title: Evaluating and Steering Modality Preferences in Multimodal Large Language Model
- Authors: Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang,
- Abstract summary: Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context.<n>We show that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions.<n>We propose a probing and steering method based on representation engineering to explicitly control modality preference.
- Score: 32.94581875014947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.
Related papers
- MokA: Multimodal Low-Rank Adaptation for MLLMs [11.440424554587674]
Multimodal low-rank Adaptation (MokA) is a multimodal-aware efficient fine-tuning strategy.<n>MokA compresses unimodal information by modality-specific parameters while explicitly enhancing cross-modal interaction.
arXiv Detail & Related papers (2025-06-05T16:04:08Z) - Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z) - TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models [23.916205754112774]
Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks.<n>We propose TAMP, a simple yet effective pruning framework tailored for MLLMs.<n>We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities.
arXiv Detail & Related papers (2025-04-14T05:44:38Z) - Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation [67.84581846180458]
Session-based recommendation (SBR) predicts the next item based on anonymous sessions.<n>Recent Multimodal SBR methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness.<n>We propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR.
arXiv Detail & Related papers (2025-04-13T07:49:08Z) - Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs [29.07102440466282]
We propose a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy.<n>Our method involves an automated preference construction coupled with a dedicated verification process.<n>Experiments conducted on Qwen2VL-7B demonstrate IPA's effectiveness across multiple benchmarks.
arXiv Detail & Related papers (2025-03-26T08:19:02Z) - CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.<n>We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.<n>We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.<n>Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.<n>We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation [21.281471662696372]
We propose the Multimodal Large Language Model-enhanced Multimodaln Sequential Recommendation (MLLM-MSR) model.<n>To capture the dynamic user preference, we design a two-stage user preference summarization method.<n>We then employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences.
arXiv Detail & Related papers (2024-08-19T04:44:32Z) - Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning [77.72128397088409]
We show that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question.<n>We also propose a novel reinforcement learning paradigm to encourage the model to actively perform composite deduction.
arXiv Detail & Related papers (2024-04-19T15:53:27Z) - Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large
Language Models [56.256069117502385]
Chain of Thought (CoT) approaches can be used to enhance the capability of Large Language Models (LLMs) on complex reasoning tasks.
However, the selection of optimal CoT demonstration examples in multi-modal reasoning remains less explored.
We introduce a novel approach that addresses this challenge by using retrieval mechanisms to automatically select demonstration examples.
arXiv Detail & Related papers (2023-12-04T08:07:21Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Adaptive Contrastive Learning on Multimodal Transformer for Review
Helpfulness Predictions [40.70793282367128]
We propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem.
In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach.
Finally, we propose Multimodal Interaction module to address the unalignment nature of multimodal data.
arXiv Detail & Related papers (2022-11-07T13:05:56Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.