Related papers: Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

URL: http://arxiv.org/abs/2602.03677v1
Date: Tue, 03 Feb 2026 15:59:24 GMT
Title: Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration
Authors: Yu Zhang, Mufan Xu, Xuefeng Bai, Kehai chen, Pengfei Zhang, Yang Xiang, Min Zhang,
Abstract summary: Modality following serves as the capacity of multimodal large language models to selectively utilize multimodal contexts based on user instructions.<n>We show that instruction tokens function as structural anchors for modality arbitration.<n>We identify a sparse set of specialized attention heads that drive this arbitration.
Score: 41.64118238383843
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modality following serves as the capacity of multimodal large language models (MLLMs) to selectively utilize multimodal contexts based on user instructions. It is fundamental to ensuring safety and reliability in real-world deployments. However, the underlying mechanisms governing this decision-making process remain poorly understood. In this paper, we investigate its working mechanism through an information flow lens. Our findings reveal that instruction tokens function as structural anchors for modality arbitration: Shallow attention layers perform non-selective information transfer, routing multimodal cues to these anchors as a latent buffer; Modality competition is resolved within deep attention layers guided by the instruction intent, while MLP layers exhibit semantic inertia, acting as an adversarial force. Furthermore, we identify a sparse set of specialized attention heads that drive this arbitration. Causal interventions demonstrate that manipulating a mere $5\%$ of these critical heads can decrease the modality-following ratio by $60\%$ through blocking, or increase it by $60\%$ through targeted amplification of failed samples. Our work provides a substantial step toward model transparency and offers a principled framework for the orchestration of multimodal information in MLLMs.

Related papers

DLLM Agent: See Farther, Run Faster [94.74432470237817]
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties.<n>We study this in a controlled setting by instantiatingDLLM and AR backbones within the same agent workflow.<n>We find thatDLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup.
arXiv Detail & Related papers (2026-02-07T09:01:18Z)
Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition [51.68340973140949]
Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions.<n> MLLMs exhibit $textbfmodality bias$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts.<n>We propose Modality-aware Consistency Reasoning ($bfMCR$), which enforces structured cross-modal reasoning.
arXiv Detail & Related papers (2026-02-04T12:12:49Z)
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.<n>It significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches.
arXiv Detail & Related papers (2026-02-03T14:48:12Z)
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs [5.380090638488105]
MMA-Bench comprises videos and tasks that probe a model's reliance on specific modalities.<n>We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text.<n>We propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues.
arXiv Detail & Related papers (2025-11-28T01:21:29Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection [9.049925971684837]
Insider threat modality (ITD) poses a persistent and high-impact challenge in cybersecurity.<n>Traditional models often struggle to capture semantic intent and complex behavior dynamics.<n>We propose DMFI, a dual-modality framework that integrates semantic inference with behavior-aware fine-tuning.
arXiv Detail & Related papers (2025-08-06T18:44:40Z)
Fine-grained Multiple Supervisory Network for Multi-modal Manipulation Detecting and Grounding [15.670987587762259]
Multi-Modal Media Manipulation (DGM$4$) is a branch of misinformation detection.<n>Existing methods are often limited in performance due to the erroneous interference caused by unreliable unimodal data.<n>We present a Fine-grained Multiple Supervisory (FMS) network, which incorporates modality reliability supervision, unimodal internal supervision and cross-modal supervision.
arXiv Detail & Related papers (2025-08-04T14:46:59Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Prompt Highlighter: Interactive Control for Multi-Modal LLMs [50.830448437285355]
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. We introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. We find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.
arXiv Detail & Related papers (2023-12-07T13:53:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.