Related papers: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

URL: http://arxiv.org/abs/2509.22496v1
Date: Fri, 26 Sep 2025 15:38:42 GMT
Title: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Authors: Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Xiaochun Cao,
Abstract summary: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs.<n>Yet, the extent to which generated tokens depend on visual modalities remains poorly understood.<n>We present a lightweight black-box framework for explaining autoregressive token generation in MLLMs.
Score: 59.40886078302025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States [13.418437639290532]
Sentence representations are foundational to many Natural Language Processing (NLP) applications.<n>This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states.
arXiv Detail & Related papers (2026-02-02T03:09:37Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Token Activation Map to Visually Explain Multimodal LLMs [23.774995444587667]
We propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation.<n>We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens.<n>Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results.
arXiv Detail & Related papers (2025-06-29T14:50:45Z)
Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs [62.9348974370985]
We propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost.<n>Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens.<n>Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors.
arXiv Detail & Related papers (2025-03-11T11:52:37Z)
Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [66.04061083611863]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.<n>We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)<n>DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
IERL: Interpretable Ensemble Representation Learning -- Combining CrowdSourced Knowledge and Distributed Semantic Representations [11.008412414253662]
Large Language Models (LLMs) encode meanings of words in the form of distributed semantics. Recent studies have shown that LLMs tend to generate unintended, inconsistent, or wrong texts as outputs. We propose a novel ensemble learning method, Interpretable Ensemble Representation Learning (IERL), that systematically combines LLM and crowdsourced knowledge representations.
arXiv Detail & Related papers (2023-06-24T05:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.