Fugu-MT 論文翻訳(概要): Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

論文の概要: Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

arxiv url: http://arxiv.org/abs/2604.12046v1
Date: Mon, 13 Apr 2026 20:38:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.116869
Title: Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration
Title（参考訳）: 不確実性を通して考える:推論校正による長寿命世代特性の改善
Authors: Xin Liu, Lu Wang,
Abstract要約: 大型言語モデル(LLM)は、しばしば長文生成において幻覚を引き起こす。既存のアプローチは主にポストホックリビジョンや強化学習を通じて事実性を改善する。本稿では,LCMに請求レベルでの不確実性について推論するように教えることにより,長期的事実性を改善するフレームワークであるCUREを提案する。
参考スコア（独自算出の注目度）: 7.51755942515969
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims' correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.
Abstract（参考訳）: 大型言語モデル(LLM)は、しばしば長文生成において幻覚を引き起こす。既存のアプローチは主に、正当性に基づく報酬を伴うポストホックリビジョンや強化学習(RL)を通じて事実性を改善するが、どの世代が信頼できるかをモデルに教えていない。結果として、モデルは依然として、彼らの反応に自信を持って不正確なクレームを述べることができる。近年の推論の進歩はLLM性能を著しく向上させ, キャリブレーションをRL目標に組み込むことで信頼性を推定するために活用されている。しかし、既存のアプローチは応答全体の単一のスカラー信頼に限られており、個々のクレームによって不確実性が変化する長文生成には不十分である。この問題を軽減するために,LLMに請求レベルでの不確実性を推論するように教えることにより,長期的事実性を改善するフレームワークCUREを提案する。まず、出力を明示的な信頼度推定と組み合わせた原子クレームに構造化するCrim-Aware Reasoning Protocolを導入する。次に、モデルの信頼性とクレームの正しさを一致させ、事実性に基づいて最適化する、多段階のトレーニングパイプラインを開発します。得られたキャリブレーションされた信頼性により、選択的な予測が可能となり、モデルが推論時に不確実なクレームを排除できる。 4つの長期の事実性ベンチマークの実験では、CUREは、競合監督とRLベースラインよりも、ファクトリコールを維持しながら、一貫してファクトリコールの精度を向上させることが示されている。特に、バイオグラフィー生成において、クレームレベルの精度を39.9%向上させる。これらの利得は、FactBench上のAUROCの16.0%の増加に反映されるように、キャリブレーションの改善が伴う。

論文の概要: Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

関連論文リスト