Fugu-MT 論文翻訳(概要): Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

論文の概要: Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

arxiv url: http://arxiv.org/abs/2604.14325v1
Date: Wed, 15 Apr 2026 18:32:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:29.988181
Title: Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
Title（参考訳）: 信心性血清:帰属指導によるLCM決定のテキスト説明における信心性ギャップの緩和
Authors: Bar Alon, Itamar Zimerman, Lior Wolf,
Abstract要約: 大規模言語モデル(LLM)は高い性能を達成し、NLPに革命をもたらした。説明責任の欠如はブラックボックスとして扱われ、透明性と信頼を求めるドメインでの使用を制限する。本研究では,注意レベルの介入を通じて説明生成を導くことにより,信頼感を高める訓練自由手法を提案する。
参考スコア（独自算出の注目度）: 57.17102098930037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.
Abstract（参考訳）: 大規模言語モデル(LLM)は強力なパフォーマンスを実現し、NLPに革命をもたらしたが、説明責任の欠如によりブラックボックスとして扱われ、透明性と信頼を求めるドメインでの使用が制限される。この問題に対処するための有望な方向は、自然言語でモデル決定を説明することを目的とした、ポストホックテキストベースの説明である。以前の研究は、主観的に忠実であるように見える合理的な理性を生み出すことに焦点が当てられていたが、これらの説明が認識的に忠実であるか、モデルがその決定に実際に頼っていた内部的証拠を反映しているかは定かではない。本稿ではまず, LLM 生成した説明の認識的忠実度を, 反事実を通して評価し, しばしば不信感であることを示す。次に、注意レベルの介入を通じて説明生成を誘導し、忠実な帰属法によって抽出されたトークンレベルのヒートマップから情報を得ることにより、信頼度を高める訓練自由手法を提案する。この方法は、複数のモデル、ベンチマーク、プロンプトにまたがる疫学的な忠実性を大幅に改善する。

論文の概要: Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

関連論文リスト