Fugu-MT 論文翻訳(概要): Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

論文の概要: Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

arxiv url: http://arxiv.org/abs/2603.24844v1
Date: Wed, 25 Mar 2026 22:20:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.005387
Title: Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models
Title（参考訳）: モードを超えたリーチ:言語モデルにおける分散推論のためのRL
Authors: Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim,
Abstract要約: 本稿では,複数解に対する分布推論を行うための多解補足学習手法について述べる。質問応答, 診断, コーディングベンチマークを通じて, 単一回答学習ベースラインと比較して, 多様性, カバレッジ, 設定レベルの校正スコアが向上した。
参考スコア（独自算出の注目度）: 78.68818219506313
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.
Abstract（参考訳）: ある質問に対して、言語モデル(LM)は暗黙的に答えの分布を符号化する。実際には、LMのポストトレーニング手順は、この分布を1つの支配的なモードに分解することが多い。これは一般に1つの正しい答えを仮定するベンチマークスタイルの評価には問題はないが、現実の多くのタスクは本質的に複数の有効な答えまたは既約不確実性を含む。例えば、診断、曖昧な質問応答、不完全な情報の設定などがある。これらのケースでは、LMに複数の可算仮説を生成させ、理想的には各仮説に対する信頼度を推定し、非モーダルな答えを生成するために計算的に繰り返しサンプリングを行わないよう求めている。本稿では,複数解に対する分布推論を行うための多解補足学習手法について述べる。我々は、モデルが1つの前方通過で複数の候補解を明示的に生成できるようにRLの目的を変更し、モデルの生成過程における推論時間探索の側面を内部化する。質問応答, 診断, コーディングベンチマークを通じて, 単一回答学習ベースラインと比較して, 多様性, カバレッジ, 設定レベルの校正スコアが向上した。私たちのアプローチでトレーニングされたモデルは、競合するアプローチよりも複数の回答を生成するためにトークンが少なくなります。コーディングタスクに関しては、かなり正確です。これらの結果は、マルチアンサーRLを、ベスト・オブ・kのような推論時間スケーリングの方法に代わる、原理的で計算効率のよい代替品として位置づけている。コードと詳細はhttps://multi-answer-rl.github.io/.com/で確認できる。

論文の概要: Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

関連論文リスト