Fugu-MT 論文翻訳(概要): DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

論文の概要: DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

arxiv url: http://arxiv.org/abs/2508.14460v1
Date: Wed, 20 Aug 2025 06:31:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.359888
Title: DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
Title（参考訳）: DuPO:デュアル優先度最適化による信頼性LLM自己検証の実現
Authors: Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang,
Abstract要約: アノテーションのないフィードバックを生成するデュアルラーニングベースの選好最適化フレームワークであるDuPOを提案する。具体的には、DuPOはプリミティブタスクの入力を未知のコンポーネントに分解し、その2つのタスクを構築して未知の部分を再構築する。 756方向の平均翻訳品質を2.13 COMETで向上し、3つのベンチマークで平均6.4ポイントの数学的推論精度を向上し、推論時間リランカとしてのパフォーマンスを9.3ポイント向上させる。
参考スコア（独自算出の注目度）: 47.32314866162273
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
Abstract（参考訳）: 我々は、一般化された双対性を通してアノテーションのないフィードバックを生成するデュアルラーニングベースの選好最適化フレームワークであるDuPOを提案する。 DuPOは2つの重要な制限に対処する: Reinforcement Learning with Verifiable Rewards(RLVR)のコストの高いラベルと適用性への依存は検証可能なタスクに限定され、従来の二重学習は厳密な2つのタスクペア(例えば、翻訳と後方翻訳)に制限される。具体的には、DuPOはプリミティブタスクの入力を未知のコンポーネントに分解し、その2つのタスクを構築して、プリミティブ出力と既知の情報(例えば、隠れ変数を復元する数学の解を逆転する)を用いて未知の部分を再構築し、非可逆タスクに適用性を広げる。この再構成の質は、プリミティブタスクを最適化するための自己指導型報酬として機能し、LLMが両方のタスクを1つのモデルでインスタンス化する能力と相乗効果を持つ。経験的に、DuPOは、平均翻訳品質を756方向平均2.13 COMETで向上し、3つのベンチマークで平均6.4ポイントの数学的推論精度を向上し、推論時間リランカとして9.3ポイントの性能を向上する(精度のトレーディング計算)。これらの結果は、DuPOをLLM最適化のためのスケーラブルで汎用的でアノテーションのないパラダイムとして位置づけている。

論文の概要: DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

関連論文リスト