Fugu-MT 論文翻訳(概要): UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following

論文の概要: UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following

arxiv url: http://arxiv.org/abs/2509.25148v1
Date: Mon, 29 Sep 2025 17:53:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:20.194772
Title: UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following
Title（参考訳）: UniAPL: インストラクション・フォローのための統一された逆選好学習フレームワーク
Authors: FaQiang Qian, WeiKun Zhang, Ziliang Wang, Kang An, Xuhui Zheng, Liangjian Wen, Mengya Gao, Yong Dai, Yichao Wu,
Abstract要約: トレーニング後のアライメントは基本的には、参照学習の統一的な問題である、と我々は主張する。 UniAPLは、SFTと嗜好データの混合バッチから共同で学習する、単一段階の統合トレーニング目標を実装している。
参考スコア（独自算出の注目度）: 12.924923059340395
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Shaping powerful LLMs to be beneficial and safe is central to AI alignment. We argue that post-training alignment is fundamentally a unified Preference Learning problem, involving two modalities: demonstrated preferences (e.g., Supervised Fine-Tuning, SFT) and comparative preferences (e.g., Reinforcement Learning, RL).The standard sequential pipeline-SFT followed by RL-is flawed due to a critical distributional mismatch: SFT uses static expert data, but as the policy evolves, its generation distribution drifts, making SFT knowledge brittle. Subsequent RL then explores without direct access to the rich, ground-truth knowledge in expert demonstrations, leading to inefficient, ungrounded updates. This separation prevents mutual regularization between data sources. To address this, we reframe alignment as a constrained optimization problem and propose Unified Adversarial Preference Learning (UniAPL),a novel framework that dynamically aligns the policy's distribution with the expert's. UniAPL implements a single-stage unified training objective, jointly learning from mixed batches of SFT and preference data. In every gradient step, dense expert demonstrations directly ground and regularize online exploration, inherently resolving distributional mismatch and maximizing data synergy.We evaluate UniAPL on instruction-following tasks using Qwen3-235B-Instruct-2507 as the teacher. Our models match or exceed strong GRPO baselines: +5.77% on Qwen3-0.6B (matching a 32B model) and +3.75% on Qwen3-4B,even outperforming the teacher. Analyses of response length and log-probability distributions confirm that UniAPL outputs closely mimic expert demonstrations, achieving both stronger performance and better behavioral alignment.
Abstract（参考訳）: 強力なLLMを有用かつ安全に形成することは、AIアライメントの中心である。トレーニング後のアライメントは基本的に統一された選好学習問題であり,2つのモダリティ(例えば,スーパービジョンファインチューニング,SFT)と,比較選好(例えば,強化学習,RL)が関係している。 SFTは静的な専門家データを使用するが、ポリシーが進化するにつれて、その世代分布がドリフトし、SFTの知識は不安定になる。その後RLは、専門家によるデモンストレーションにおいて、リッチで地味な知識に直接アクセスすることなく探索し、非効率で地味な更新をもたらす。この分離により、データソース間の相互規則化が防止される。これを解決するために、制約付き最適化問題としてアライメントを再構築し、専門家とポリシーの分布を動的に整合させる新しいフレームワークUnified Adversarial Preference Learning (UniAPL)を提案する。 UniAPLは、SFTと嗜好データの混合バッチから共同で学習する、単一段階の統合トレーニング目標を実装している。教師としてQwen3-235B-Instruct-2507を用いた指導追従タスクにおけるUniAPLの評価を行った。 Qwen3-0.6Bでは+5.77%(32Bモデルに適合)、Qwen3-4Bでは+3.75%、教師では+3.75%である。応答長と対数確率分布の分析により、UniAPLの出力が専門家による実証と密接に類似していることが確認され、より強力な性能とより優れた行動アライメントが達成される。

論文の概要: UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following

関連論文リスト