Fugu-MT 論文翻訳(概要): Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

論文の概要: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

arxiv url: http://arxiv.org/abs/2509.22496v1
Date: Fri, 26 Sep 2025 15:38:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.555343
Title: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Title（参考訳）: MLLMが目指すもの - 自己回帰型トークン生成の解説
Authors: Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Xiaochun Cao,
Abstract要約: マルチモーダル大規模言語モデル(MLLM)は、視覚入力と自然言語出力の整合性を示す。しかし、生成したトークンが視覚的モダリティに依存する範囲は、いまだに理解されていない。 MLLMにおける自己回帰トークン生成を説明するための軽量なブラックボックスフレームワークを提案する。
参考スコア（独自算出の注目度）: 59.40886078302025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)は、視覚入力と自然言語出力の整合性を示す。しかし、生成したトークンが視覚的モダリティに依存する範囲は、解釈可能性と信頼性を制限し、まだ理解されていない。本研究では,MLLMにおける自己回帰トークン生成を説明する軽量なブラックボックスフレームワークであるEAGLEを提案する。 EAGLEは選択されたトークンをコンパクトな知覚領域に属性付けし、言語先行と知覚的証拠の相対的な影響を定量化する。このフレームワークは、忠実で効率的な帰属のために、スパース化された画像領域に対する欲求探索によって最適化された、満足度(視力欠如)と不必要度(不必要スコア)を統一する客観的機能を導入する。空間属性以外にも、EAGLEは、トークンが依存するものを解剖するモダリティ認識分析を行い、モデル決定のきめ細かい解釈性を提供します。オープンソースMLLMの広範な実験により、EAGLEは、GPUメモリを著しく少なくしながら、忠実さ、ローカライゼーション、幻覚診断において、既存の手法を一貫して上回っていることが示されている。これらの結果は,MLLMの解釈可能性を高めるための有効性と実用性を強調した。コードはhttps://github.com/RuoyuChen10/EAGLEで公開されている。

論文の概要: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

関連論文リスト