Fugu-MT 論文翻訳(概要): Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

論文の概要: Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

arxiv url: http://arxiv.org/abs/2604.04500v1
Date: Mon, 06 Apr 2026 07:51:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.13806
Title: Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Title（参考訳）: Saliency-R1: Saliency-map Alignment Reward による解釈可能かつ忠実な視覚言語推論の実現
Authors: Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou,
Abstract要約: 視覚言語モデル(VLM)の解釈性と忠実性を改善するためのフレームワークであるSaliency-R1を提案する。本稿では,生成したトークンに寄与する重要な画像領域を,計算オーバーヘッドを伴わずに効率よく強調する新しいサリエンシマップ手法を提案する。実験では、Saliency-R1は忠実さ、解釈可能性、全体的なタスクパフォーマンスの推論を改善している。
参考スコア（独自算出の注目度）: 26.150136674969605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.
Abstract（参考訳）: 視覚言語モデル(VLM)は様々なタスクで顕著な成功を収めた。しかしながら、彼らの信頼性に関する懸念は、特に視覚的証拠よりもテキスト的手がかりに傾倒する傾向と、根拠のない、または製造されていない応答を生み出すリスクについて継続する。これらの問題に対処するため、我々はVLM推論の解釈可能性と忠実性を改善するためのフレームワークであるSaliency-R1を提案する。具体的には,生成したトークンに寄与する重要な画像領域を,計算オーバーヘッドを伴わずに効率よく強調する新しいサリエンシマップ手法を提案する。これはさらに、視覚情報が推論プロセスから最終回答へとどのように流れていくかを追跡し、思考プロセスと視覚的コンテキストの整合性を明らかにするために拡張することができる。我々は、報酬関数として、サリエンシマップと人間アノテーション付きバウンディングボックスの重複を利用し、グループ相対政策最適化(GRPO)を適用して、サリエンシ部分と臨界領域を整列させ、推論を行う際に、モデルが関連する領域に集中するように促す。実験では、Saliency-R1は忠実さ、解釈可能性、全体的なタスクパフォーマンスの推論を改善している。

論文の概要: Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

関連論文リスト