Fugu-MT 論文翻訳(概要): Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs

論文の概要: Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs

arxiv url: http://arxiv.org/abs/2512.10040v1
Date: Wed, 10 Dec 2025 19:45:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-12 16:15:42.034115
Title: Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs
Title（参考訳）: LLMの直接選好最適化のための多重参照モデルの知的重み付け
Authors: Skyler Wu, Aymen Echarghaoui,
Abstract要約: 直接選好最適化(DPO)に基づくMRPO(Multiple-Reference Preference Optimization)の構築基準重みを設定する現在の方法は、アドホックで統計的に不健全であり、信頼性の低い性能をもたらす。 2つのオフライン手法はホールドアウト検証信号を利用しており、もう1つはスライディングウインドウ推定器を使ってオーバーフィッティングを減らすオンライン手法である。政策モデルとしてのQwen2.5-0.5Bと、Llama, Mistral, Qwen, Yi, Phi の7つの基準モデル(それぞれ0.5B-14B)を用いた実験により、我々の戦略の4つ全てが、電流よりも優れていることが示された。
参考スコア（独自算出の注目度）: 2.0411082897313984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches -- calling into question the practical appeal of multiple-reference approaches.
Abstract（参考訳）: ファインチューニングは、大きな言語モデル(LLM)と人間の嗜好の整合に不可欠である。 MRPO(Multiple-Reference Preference Optimization)は、LLMを好みデータセットに微調整し、参照モデルの混合に対するポリシーを規則化し、それらの集合的望ましい特性を活用することで、直接参照最適化(DPO)を構築する。しかし、現在の基準重み設定法はアドホックで統計的に不正確であり、信頼性の低い性能をもたらす。そこで本研究では, ホールドアウト検証信号を利用する2つのオフライン手法, オーバーフィッティングを低減するためにスライディングウインドウ推定器を使用する1つのオンライン手法, およびトンプソン・サンプリングを介して, 参照重み付けを$K$の武器付きバンディットとして扱うオンライン手法を紹介する。政策モデルとしてQwen2.5-0.5Bを用い,Llama,Mistral,Qwen,Yi,Phiの7つの基準モデル(0.5B-14B)を用いて実験した結果,UltraFeedbackおよびSafeRLHFのMRPO重み付け法では,4つの戦略のすべてに勝っていることがわかった。しかし、もっと思い起こさせるのは、単一の参照DPOは、7つの参照のうち6つのうちどれかを使って、テスト対象の複数の参照アプローチを一貫して上回っているということです。

論文の概要: Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs

関連論文リスト