Fugu-MT 論文翻訳(概要): DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

論文の概要: DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.06302v1
Date: Fri, 06 Mar 2026 14:07:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.878975
Title: DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
Title（参考訳）: DEX-AR:自己回帰型視覚言語モデルのための動的説明可能性法
Authors: Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne,
Abstract要約: 本稿では,新しい説明可能性法であるDEC-ARを提案する。それは、モデルのテキスト応答に不可欠な画像領域を強調する、トーケン単位とシーケンスレベルの2Dヒートマップの両方を生成する。 ImageNet, VQAv2, PascalVOC による評価では, 両摂動測定値に一貫した改善が見られた。
参考スコア（独自算出の注目度）: 27.64151438258739
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.
Abstract（参考訳）: VLM(Vision-Language Models)が洗練され、広く使われるようになると、意思決定プロセスを理解することがますます重要になる。分類タスク用に設計された従来の説明可能性手法は、複雑なトークン・バイ・トケン生成プロセスと視覚的・テキスト的モダリティ間の複雑な相互作用のため、現代の自己回帰型VLMと競合する。提案するDEX-AR(Dynamic Explainability for AutoRegressive Model)は,テキスト応答に欠かせない画像領域を強調表示する2次元熱マップを1点あたりとシーケンス単位で生成することで,これらの課題に対処する新しい説明可能性手法である。提案手法は, トークン・バイ・トークン生成過程において, 注目マップに対して, 層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・層別・ DEX-ARは、視覚情報に焦点をあてた注意を識別する動的ヘッドフィルタリング機構と、視覚的に接地されたトークンと純粋に言語的なトークンを区別しながら、トーケン毎の説明を集約するシーケンスレベルのフィルタリングアプローチという2つの重要なイノベーションを導入している。 ImageNet, VQAv2, PascalVOC に対する評価では,新しい正規化パープレキシティ尺度とセグメンテーションに基づく測定値を用いて,両摂動に基づく測定値の整合性の向上が示されている。

論文の概要: DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

関連論文リスト