Fugu-MT 論文翻訳(概要): From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models

論文の概要: From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models

arxiv url: http://arxiv.org/abs/2510.05095v1
Date: Mon, 06 Oct 2025 17:58:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:53:00.048275
Title: From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models
Title（参考訳）: 雑音のトレースから安定な勾配へ:大規模共振モデルに対するバイアス分散最適化優先最適化
Authors: Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia,
Abstract要約: 大規模推論モデルは最終回答を生成する前に中間的推論トレースを生成する。 LRMと人間の好みの整合性は、モデルデプロイメントにとって重要な前提条件であり、まだ過小評価されていない。共通の回避策は1つのサンプル軌道を最適化し、トレースサンプリングからかなり勾配のばらつきをもたらす。
参考スコア（独自算出の注目度）: 90.45197506653341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large reasoning models (LRMs) generate intermediate reasoning traces before producing final answers, yielding strong gains on multi-step and mathematical tasks. Yet aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored. The statistically correct objective for preference alignment requires marginalizing over reasoning traces, but this computation is intractable in practice. A common workaround optimizes a single sampled trajectory, which introduces substantial gradient variance from stochastic trace sampling. To address this challenge, we frame preference optimization for LRMs through the lens of the bias--variance trade-off and propose Bias--Variance Optimized Preference Optimization (BVPO), a simple, drop-in method that mixes two gradient estimators: a high-variance trace-based estimator and a low-variance empty-trace estimator obtained by disabling reasoning trace generation. Our theory shows that BVPO strictly reduces trace-induced variance for any nontrivial mixture, provides a closed-form choice of the mixing weight that minimizes mean-squared error relative to the true marginal gradient, and under standard smoothness and step-size conditions, tightens classical convergence bounds for stochastic gradient descent. Empirically, BVPO improves alignment over the best baseline by up to 7.8 points on AlpacaEval~2 and 6.8 points on Arena-Hard. Despite being trained only on general conversational data, BVPO also boosts reasoning performance for base models by up to 4.0 points on the average of six math reasoning benchmarks. These results identify variance from trace sampling as a key bottleneck and demonstrate that directly optimizing the bias--variance trade-off yields more stable training and stronger overall performance.
Abstract（参考訳）: 大規模推論モデル(LRMs)は、最終回答を生成する前に中間的推論トレースを生成し、多段階および数学的タスクに対して強い利得をもたらす。しかし、モデル展開において重要な前提条件である人間の嗜好とLRMの整合性は未解明のままである。選好アライメントの統計的に正しい目的は、推論トレースよりも限界化を必要とするが、実際にはこの計算は難解である。共通の回避策は1つのサンプル軌道を最適化し、確率的トレースサンプリングからかなり勾配のばらつきをもたらす。この課題に対処するため, バイアス分散トレードオフのレンズを用いてLEMの選好最適化を行い, 高分散トレースベース推定器と低分散空トレース推定器の2つの勾配推定器を混合したシンプルなドロップイン手法であるバイアス分散最適化最適化(BVPO)を提案する。我々の理論は、BVPOが任意の非自明な混合に対するトレース誘起分散を厳密に低減し、真の辺勾配に対する平均二乗誤差を最小化する混合重みの閉形式選択を提供し、標準の滑らかさとステップサイズ条件の下では、確率勾配勾配に対する古典収束境界を締め付けることを示している。経験的に、BVPOはアルパカ・エバルで7.8ポイント、アリーナ・ハードで6.8ポイントのアライメントを改善する。 BVPOは一般的な会話データのみに基づいて訓練されているにもかかわらず、基礎モデルの推論性能を6つの数学推論ベンチマークの平均で最大4.0ポイント向上させる。これらの結果は, トレースサンプリングからの分散を重要なボトルネックとして認識し, バイアス分散トレードオフを直接最適化することで, より安定したトレーニングと全体的な性能が向上することを示した。

論文の概要: From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models

関連論文リスト