Fugu-MT 論文翻訳(概要): TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

論文の概要: TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

arxiv url: http://arxiv.org/abs/2605.00224v1
Date: Thu, 30 Apr 2026 20:59:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.750262
Title: TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
Title（参考訳）: TUR-DPO:Topology- and Uncertainty-Aware Direct Preference Optimization
Authors: Abdulhady Abas Abdullah, Fatemeh Daneshfar, Seyedali Mirjalili, Mourad Oussalah,
Abstract要約: 提案するTUR-DPOは,DPOのトポロジおよび不確実性を考慮した変形であり,解の導出方法に報いる。これらの信号に対して小さな学習可能な報酬を分解し、RLフリーのままである不確実性重み付きDPO目標に組み込む。経験的に、7-8Bのオープンモデルとベンチマークは、数学的推論、事実的質問応答、要約、役立ち/無害な対話にまたがっており、TUR-DPOは、DPOに対する判定の勝利率、忠実さ、校正を改善している。
参考スコア（独自算出の注目度）: 22.1407356439052
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and RL-free, it treats preferences as flat winner vs. loser signals and is sensitive to noisy or brittle preferences arising from fragile chains of thought. We propose TUR-DPO, a topology- and uncertainty-aware variant of DPO that rewards how answers are derived, not only what they say, by eliciting lightweight reasoning topologies and combining semantic faithfulness, utility, and topology quality into a calibrated uncertainty signal. A small learnable reward is factorized over these signals and incorporated into an uncertainty-weighted DPO objective that remains RL-free and relies only on a fixed or moving reference policy. Empirically, across open 7-8B models and benchmarks spanning mathematical reasoning, factual question answering, summarization, and helpful/harmless dialogue, TUR-DPO improves judge win-rates, faithfulness, and calibration relative to DPO while preserving training simplicity and avoiding online rollouts. We further observe consistent gains in multimodal and long-context settings, and show that TUR-DPO matches or exceeds PPO on reasoning-centric tasks while maintaining operational simplicity.
Abstract（参考訳）: 大規模言語モデル(LLM)を人間の好みで調整することは、人間からのフィードバック(RLHF)とPPO(Porximal Policy Optimization)の強化学習、あるいはより簡単にはDPO(Direct Preference Optimization)を通じて行うのが一般的である。 DPOは安定しており、RLを含まないが、選好を平坦な勝者対敗者信号として扱い、思考の脆弱な連鎖から生じるノイズや脆い選好に敏感である。そこで我々は,DPOのトポロジ・不確実性を意識したTUR-DPOを提案する。このTUR-DPOは,単純な推論トポロジを抽出し,意味的忠実性,有用性,およびトポロジ品質を校正された不確実性信号に組み合わせることで,回答の導出に報いるものである。これらの信号に対して小さな学習可能な報酬を分解し、RLフリーのままで固定または移動参照ポリシーのみに依存する不確実性重み付きDPO目標に組み込む。経験的に、7-8Bのオープンモデルとベンチマークは、数学的推論、事実的質問応答、要約、助け/無害な対話にまたがっており、TUR-DPOはDPOに対する判定の勝利率、忠実さ、校正を改善しながら、トレーニングの単純さを保ち、オンラインのロールアウトを避ける。さらに、マルチモーダルおよび長期コンテキスト設定における一貫した利得を観察し、TUR-DPOが運用の単純さを維持しながら、推論中心のタスクにおいてPPOと一致またはPPOを超えることを示す。

論文の概要: TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

関連論文リスト