Fugu-MT 論文翻訳(概要): Semi-Supervised Preference Optimization with Limited Feedback

論文の概要: Semi-Supervised Preference Optimization with Limited Feedback

arxiv url: http://arxiv.org/abs/2511.00040v1
Date: Tue, 28 Oct 2025 01:33:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.531601
Title: Semi-Supervised Preference Optimization with Limited Feedback
Title（参考訳）: 限定フィードバックを用いた半教師付き参照最適化
Authors: Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song,
Abstract要約: 本稿では,少数のペアワイズ選好ラベルと多数の未ペアサンプルから同時に学習することを目的とした,SSPO(Semi-Supervised Preference Optimization)の問題について検討する。我々の重要な理論的貢献は、高い確率で勝利と負けの反応を分離できる最適報酬閾値の存在を証明している。これらの擬似ラベルを利用することで、SSPOは大規模未ペアデータから潜伏した嗜好を効果的に蒸留し、取得コストを大幅に削減しつつ、人間のアライメントを維持する。
参考スコア（独自算出の注目度）: 17.112054023380647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Llama3-8B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.
Abstract（参考訳）: 嗜好最適化の分野は、言語モデルと人間の嗜好の整合性に顕著な貢献をしている。これらの進歩にもかかわらず、最近の手法は依然として実質的な(ラベル付き)フィードバックデータに大きく依存しており、かなりのリソース支出につながっている。これらの課題に対処するために,少数のペアワイズ選好ラベルと多数の未ペアサンプルから同時に学習することを目的とした,半監督選好最適化(SSPO)の課題について検討する。我々の重要な理論的貢献は、高い確率で勝利と負けの反応を分離できる最適報酬閾値の存在を証明し、不対意なデータの原則的な擬似ラベルを可能にすることである。これらの擬似ラベルを利用することで、SSPOは大規模未ペアデータから潜伏した嗜好を効果的に蒸留し、取得コストを大幅に削減しつつ、人間のアライメントを維持する。例えば、UltraFeedbackの1%でLlama3-8B-InstructでトレーニングされたSSPOは、UltraFeedbackの10%でトレーニングされた強いベースラインを一貫して越えている。

論文の概要: Semi-Supervised Preference Optimization with Limited Feedback

関連論文リスト