Fugu-MT 論文翻訳(概要): A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

論文の概要: A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

arxiv url: http://arxiv.org/abs/2606.12160v2
Date: Thu, 11 Jun 2026 13:44:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 13:39:59.688164
Title: A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
Title（参考訳）: インストラクション調整LDMにおける復号時間真理性手法の制御に関する研究
Authors: Ao Sun,
Abstract要約: レイヤコントラスト復号、推論時間介入、学習ロジットアダプタは、TrathfulQAで10～30ポイントのゲインを示した。現代の命令調整型LLMは、既にかなり高いベースラインを実現している。熟考的推進法は、評価体制においてより堅牢であるように見える。
参考スコア（独自算出の注目度）: 3.4007995136788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decoding-time truthfulness methods -- layer-contrast decoding, inference-time intervention, and learned logit adapters -- have demonstrated 10-30 point gains on TruthfulQA when applied to base language models. However, modern instruction-tuned LLMs already achieve substantially higher baselines (61-76%), raising the question of whether these methods remain effective in practice. We design a six-control evaluation framework -- out-of-distribution training, multi-judge validation, simple decoding baselines, confound controls, bootstrap confidence intervals, and seed variance -- and apply it across 5 models (1B-70B), 3 benchmarks, and 15 methods. We find that previously reported gains shrink substantially under strict controls: on the full TruthfulQA benchmark (N=817), no token-level method achieves statistically significant improvement, and the best learned adapter scores -2.0 points below greedy (p=.23). We identify five evaluation sensitivities -- contamination, judge choice, missing baselines, confounds, and statistical noise -- that individually or jointly account for these discrepancies. Cross-benchmark validation on HaluEval QA and TriviaQA confirms that these patterns extend beyond TruthfulQA. Deliberative prompting methods (chain-of-thought, self-critique) appear more robust in the evaluated regime, with CoT achieving +5.6-19pp across benchmarks as a training-free, single-pass method. We release a seven-point evaluation checklist and discuss implications for future truthfulness research.
Abstract（参考訳）: Decoding-time truthfulnessメソッド -- レイヤコントラストデコーディング、推論時インタプリタ、学習ロジットアダプタ -- は、ベース言語モデルに適用した場合、TruthfulQAで10～30ポイントのゲインを示した。しかし、現代の命令チューニング LLM は、既にかなり高いベースライン(61-76%)を達成しており、これらの手法が実際に有効であるかどうかという疑問が提起されている。アウト・オブ・ディストリビューショントレーニング、マルチジャッジ検証、シンプルなデコードベースライン、コンファウンドコントロール、ブートストラップの信頼性間隔、シード分散といった6つの評価フレームワークを設計し、それを5つのモデル(1B-70B)、3つのベンチマーク、15のメソッドに適用します。また,TruthfulQAベンチマーク(N=817)では,トークンレベルの手法では統計的に有意な改善が得られず,最も学習度の高いアダプタスコアはgreedyより2.0ポイント低い(p=.23)。汚染、判断の選択、ベースラインの欠如、欠点、統計的ノイズの5つの評価感度が、これらの相違点を個人的または共同的に考慮している。 HaluEval QA と TriviaQA のクロスベンチマーク検証では、これらのパターンが TruthfulQA を超えて拡張されていることが確認されている。リベラルなプロンプト法(チェーン・オブ・シンク、自己批判)は評価体制においてより堅牢に見え、CoTはトレーニングフリーのシングルパス法としてベンチマークで+5.6-19ppを達成している。本研究では,7点評価チェックリストを公開し,今後の真理性研究の意義について論じる。

論文の概要: A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

関連論文リスト