Fugu-MT 論文翻訳(概要): Continuous-Utility Direct Preference Optimization

論文の概要: Continuous-Utility Direct Preference Optimization

arxiv url: http://arxiv.org/abs/2602.00931v1
Date: Sat, 31 Jan 2026 23:15:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.472097
Title: Continuous-Utility Direct Preference Optimization
Title（参考訳）: 連続実用性直接選好最適化
Authors: Muhammad Ahmed Mohsin, Muhammad Umer, Ahsan Bilal, Zihao He, Muhammad Usman Rafique, Asad Aali, Muhammad Ali Jamshed, John M. Cioffi, Emily Fox,
Abstract要約: 私たちは、モデルと迅速な認知戦略のポートフォリオを整合させるフレームワークである、継続的ユーティリティ直接選択最適化(CU-DPO)を紹介します。 K 戦略による学習は、二進選好よりもサンプルの複雑さが Theta(K log K) の改善をもたらすことを証明している。 CU-DPOは7つのベースモデルに対して,戦略選択の精度を35-46パーセントから68-78パーセントに向上することを示す。
参考スコア（独自算出の注目度）: 14.867957084669497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.
Abstract（参考訳）: 大きな言語モデルの推論は、しばしばモノリシックな機能として扱われ、部分的な進歩やきめ細かい推論品質を捉えるのに失敗するバイナリの優先順位管理に依存している。このフレームワークは、バイナリラベルを、きめ細かい推論品質を捉えた連続的なスコアに置き換えることで、モデルと迅速な認知戦略のポートフォリオを整合させるものです。我々は、K戦略による学習が二進選好よりもサンプルの複雑さの Theta(K log K) の改善をもたらすことを証明し、DPOがエントロピー規則化されたユーティリティ最大化ポリシーに収束することを証明した。この信号を利用するために,2段階の訓練パイプラインを提案する。一最良vs-all比較により、与えられた問題に対する最善の戦略を選択するためにモデルを最適化する戦略選択 (ii)実行改善により、マージン階層化されたペアを使用して選択した戦略を正しく実行するようモデルを訓練する。数式推論のベンチマークでは、CU-DPOは7つのベースモデルで戦略選択の精度を35-46パーセントから68-78パーセントに改善し、アウト・オブ・ディストリビューション・タスクに効果的に移行したイン・ディストリビューション・データセットで最大6.6ポイントまで、一貫したダウンストリーム推論の利得が得られる。

論文の概要: Continuous-Utility Direct Preference Optimization

関連論文リスト