Fugu-MT 論文翻訳(概要): Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

論文の概要: Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

arxiv url: http://arxiv.org/abs/2510.05703v1
Date: Tue, 07 Oct 2025 09:10:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.174671
Title: Primal-Dual Direct Preference Optimization for Constrained LLM Alignment
Title（参考訳）: 制約付きLLMアライメントの2次元直接参照最適化
Authors: Yihan Du, Seo Taek Kong, R. Srikant,
Abstract要約: 大規模言語モデル(LLM)における制約付きアライメントの問題について検討する。そこで本研究では,報酬選好データに標準DPOを用いてモデルをトレーニングする,新しい原始双対DPO手法を提案する。我々は,生産政策の最適度及び制約違反に関する厳密な理論的保証を確立する。
参考スコア（独自算出の注目度）: 16.080857375857697
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread application of Large Language Models (LLMs) imposes increasing demands on safety, such as reducing harmful content and fake information, and avoiding certain forbidden tokens due to rules and laws. While there have been several recent works studying safe alignment of LLMs, these works either require the training of reward and cost models and incur high memory and computational costs, or need prior knowledge about the optimal solution. Motivated by this fact, we study the problem of constrained alignment in LLMs, i.e., maximizing the output reward while restricting the cost due to potentially unsafe content to stay below a threshold. For this problem, we propose a novel primal-dual DPO approach, which first trains a model using standard DPO on reward preference data to provide reward information, and then adopts a rearranged Lagrangian DPO objective utilizing the provided reward information to fine-tune LLMs on cost preference data. Our approach significantly reduces memory and computational costs, and does not require extra prior knowledge. Moreover, we establish rigorous theoretical guarantees on the suboptimality and constraint violation of the output policy. We also extend our approach to an online data setting by incorporating exploration bonuses, which enables our approach to explore uncovered prompt-response space, and then provide theoretical results that get rid of the dependence on preference data coverage. Experimental results on the widely-used preference dataset PKU-SafeRLHF demonstrate the effectiveness of our approach.
Abstract（参考訳）: LLM(Large Language Models)の広範な適用は、有害なコンテンツや偽情報を減らすこと、規則や法律によって禁止されたトークンを避けることなど、安全性に対する要求を増大させる。近年、LLMの安全なアライメントを研究する研究がいくつか行われているが、これらの研究には報酬モデルとコストモデルのトレーニングが必要であり、高いメモリと計算コストを発生させるか、あるいは最適解に関する事前の知識を必要とする。この事実を動機として,LLMにおける制約付きアライメント(制約付きアライメント)の問題,すなわち出力報酬の最大化と,潜在的に安全でないコンテンツがしきい値以下にとどまることによるコストの制限について検討する。そこで,本稿では,報酬情報を提供するために標準DPOを用いてモデルを訓練し,提案した報奨情報を利用したラグランジアンDPOを,コスト優先データに基づいて微調整する手法を提案する。我々の手法はメモリと計算コストを大幅に削減し、事前知識を余分に必要としない。さらに,出力ポリシの最適度と制約違反に関する厳密な理論的保証を確立する。また、探索ボーナスを取り入れたオンラインデータ設定にアプローチを拡張し、未発見の緊急応答空間を探索し、優先データカバレッジへの依存をなくす理論的結果を提供する。広く使われている選好データセットPKU-SafeRLHFの実験結果から,提案手法の有効性が示された。

論文の概要: Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

関連論文リスト