Fugu-MT 論文翻訳(概要): EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers

論文の概要: EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers

arxiv url: http://arxiv.org/abs/2606.01601v1
Date: Mon, 01 Jun 2026 02:56:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.884994
Title: EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers
Title（参考訳）: EIVE:End-to-End Instance-Specific Visual Explanations for Detection Transformers
Authors: Jianlin Xiang, Yanshan Li, Linhui Dai,
Abstract要約: EIVE(End-to-end Instance-specific Visual Explanation framework)を提案する。 EIVEは、検出トランスフォーマー(DETR)のようなモデルの前方通過に従って、インスタンスレベルのサリエンシマップを直接生成する。 MS 2017の実験では、ExDarkとCityscapesは、EIVEが高品質のインスタンスレベルの唾液マップを生成することを示した。
参考スコア（独自算出の注目度）: 7.91708974258006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at https://github.com/xjlDestiny/EIVE.git.
Abstract（参考訳）: オブジェクト検出の視覚的説明性は、検出のマルチインスタンス性のため、依然として困難である。既存のアプローチでは、事前訓練された検出器を解釈するために、勾配ベースや摂動に基づく説明法のようなポストホックパラダイムを主に採用している。しかし、これらの手法にはさらなる勾配計算や繰り返しモデル推論が必要であり、効率は制限される。この問題に対処するために,検出変換器(DETR)モデルに類似したインスタンスレベルのサリエンシマップを直接生成する,エンドツーエンドのインスタンス固有のVisual Explanationフレームワーク(EIVE)を提案する。具体的には,デコーダのクロスアトリビューション機構をインスタンスレベルの特徴属性経路として再構成し,各オブジェクトクエリのクロスアトリビューションが予測されたインスタンスの視覚的アトリビューションに対応するようにする。この定式化に基づいて,デコーダ層にまたがるクロスアテンション信号を集約し,安定かつコンパクトな説明を行う,クロス層ハイブリッドコンセンサス融合 (CLHCF) モジュールを設計する。 EIVEの説明プロセスでは、勾配計算も入力摂動も必要とせず、高い計算効率が得られ、シングルスケールおよびマルチスケールのDETRのような物体検出器にも適用できる。最後に,アテンション・アウェア・ジョイント・トレーニング・ストラテジー (AAJTS) をトレーニング指向のアプリケーションとして提案し,アテンション・パターンに空間的制約を課し,安定かつ集中的な属性表現を奨励し,解釈可能性と検出性能を両立させる。 MS COCO 2017、ExDark、Cityscapesでの実験では、EIVEは高品質なインスタンスレベルのサリエンシマップを生成し、標準的なメトリクスをまたいだ最先端のポストホックメソッドに匹敵するパフォーマンスを達成し、説明効率を大幅に改善した。コードはhttps://github.com/xjlDestiny/EIVE.git.comで入手できる。

論文の概要: EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers

関連論文リスト