Fugu-MT 論文翻訳(概要): Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers

論文の概要: Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers

arxiv url: http://arxiv.org/abs/2606.21678v1
Date: Fri, 19 Jun 2026 18:37:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 04:04:08.945366
Title: Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers
Title（参考訳）: 疑わしいが忠実ではない: 自然言語をプログラム的検証に結合する
Authors: Vatsal Ananthula, Adarsh Kumarappan,
Abstract要約: 言語モデルはそれらの予測に対して妥当な有理性を生成することができるが、これらの説明はモデルの内部的推論を忠実に表すものではないかもしれない。本稿では,インラインクレームを推論トレースに挿入するフレームワークである検証器結合推論を提案し,プログラム的検証器出力を予測するための補助整合ヘッドを訓練する。整合性トレーニングは、検証情報を合理性表現から復調可能にしますが、復調性は忠実な生成を保証しません。
参考スコア（独自算出の注目度）: 0.7212939068975618
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models can generate plausible rationales for their predictions, but these explanations may not faithfully represent the model's internal reasoning. We propose verifier-coupled reasoning, a framework that inserts inline claims into reasoning traces and trains an auxiliary consistency head to predict programmatic verifier outputs from rationale-span hidden states. The central finding is a gap between decodability and faithfulness: consistency training reliably makes verifier information decodable from rationale representations, but decodability does not guarantee faithful generation. In LeanCheck (formal theorem proving), rationale-only and proof-only pooling achieve perfect directional separation under counterfactual conflict. In KataGo (Go engine), commentary spans encode 10-way win-rate buckets at 81% accuracy. Yet in a code setting, the model achieves 98.6% coupling while its generated explanations remain unfaithful: fluent prose with correct structured claims, but describing unrelated algorithms; a controlled pretrained-vs-from-scratch comparison shows the gap is not capacity-driven. Synthetic activation patching confirms causal influence (73-89% vs. 31% baseline), FEVER reveals that evidence-only pooling isolates genuine evidence sensitivity at the cost of raw accuracy, and per-claim analysis shows that consistency loss disproportionately benefits fine-grained claims over binary ones. These results establish that consistency losses are effective diagnostics and representation-shaping tools, but not sufficient conditions for faithful reasoning.
Abstract（参考訳）: 言語モデルはそれらの予測に対して妥当な有理性を生成することができるが、これらの説明はモデルの内部的推論を忠実に表すものではないかもしれない。本稿では,インラインクレームを推論トレースに挿入し,有理スパン隠れ状態からプログラムによる検証結果を予測するための補助整合ヘッドを訓練するフレームワークである検証器結合推論を提案する。一貫性のトレーニングは、検証情報を合理的表現から確実に復号化させるが、復号化は忠実な生成を保証しない。 LeanCheck(形式的定理証明)では、論理のみ、証明のみのプーリングは、反実的衝突の下で完全な方向性の分離を実現する。 KataGo(Goエンジン)では、コメンタリーが10ウェイの勝率バケットを81%の精度でエンコードしている。しかし、コード設定では、モデルが98.6%の結合を実現し、生成した説明は偽りのままである: 正しい構造化されたクレームを持つが、無関係なアルゴリズムを記述する、制御された事前訓練されたvs-from-scratch比較は、ギャップがキャパシティ駆動ではないことを示している。合成活性化パッチングは因果的影響(73-89%対31%)を確認し、FEVERはエビデンスのみのプールは生の精度で真のエビデンス感度を分離することを示した。これらの結果は、一貫性の喪失は効果的な診断と表現形成ツールであるが、忠実な推論には十分でないことを証明している。

論文の概要: Decodable but Not Faithful: Coupling Natural-Language Rationales to Programmatic Verifiers

関連論文リスト