Fugu-MT 論文翻訳(概要): Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

論文の概要: Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

arxiv url: http://arxiv.org/abs/2604.04410v1
Date: Mon, 06 Apr 2026 04:21:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.089278
Title: Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
Title（参考訳）: 安定かつ統計的に一貫性のあるモデルアライメントに対する相対密度比最適化
Authors: Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda,
Abstract要約: 直接密度比最適化(DDRO)は、人間の選好モデルを仮定することなく統計的整合性を達成する。本稿では,安定かつ統計的に一貫した新しいアライメント手法を提案する。
参考スコア（独自算出の注目度）: 40.653679055257
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non-preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.
Abstract（参考訳）: 安全性と信頼性を確保するためには、言語モデルを人間の好みで調整することが不可欠である。既存のほとんどのアプローチではブラッドリー・テリーモデルのような特定の人間の嗜好モデルを想定しているが、この仮定は真の人間の嗜好を正確に捉えることができず、その結果、これらの手法は統計的に整合性がなく、すなわち、サンプルの数が増えるにつれて言語モデルが真の人間の嗜好に収束することを保証する。対照的に、直接密度比最適化(DDRO)は、人間の選好モデルを仮定することなく統計的整合性を達成する。 DDROは、言語モデルを用いて、優先データと非優先データの間の密度比をモデル化し、密度比推定により最適化する。しかし、この密度比は不安定であり、しばしば分岐し、DDROのトレーニング不安定が生じる。本稿では,安定かつ統計的に一貫した新しいアライメント手法を提案する。提案手法は, 優先データ分布と優先データ分布と非優先データ分布の混合データとの相対密度比に基づく。この相対密度比は上に有界であり、分岐しないため、我々のアプローチは安定である。さらに、統計的に一貫したものであり、DDROよりもはるかに厳密な収束を保証する。 Qwen 2.5 と Llama 3 を用いて実験を行った。

論文の概要: Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

関連論文リスト