Fugu-MT 論文翻訳(概要): Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback

論文の概要: Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback

arxiv url: http://arxiv.org/abs/2603.12595v1
Date: Fri, 13 Mar 2026 02:51:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.862106
Title: Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
Title（参考訳）: 人フィードバックによる個人化強化学習のためのスワップ誘導型選好学習
Authors: Gihoon Kim, Euntai Kim,
Abstract要約: 変分選好学習(VPL)は、ユーザ固有の潜伏変数を導入することで、この問題に対処しようとしている。スパースな選好データの下では、VPLは潜伏変数を無視し、シングルリワードモデルに戻す。この制限を克服するために,Swap-guided Preference Learning (SPL)を提案する。
参考スコア（独自算出の注目度）: 16.26441026659651
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely used approach to align large-scale AI systems with human values. However, RLHF typically assumes a single, universal reward, which overlooks diverse preferences and limits personalization. Variational Preference Learning (VPL) seeks to address this by introducing user-specific latent variables. Despite its promise, we found that VPL suffers from posterior collapse. While this phenomenon is well known in VAEs, it has not previously been identified in preference learning frameworks. Under sparse preference data and with overly expressive decoders, VPL may cause latent variables to be ignored, reverting to a single-reward model. To overcome this limitation, we propose Swap-guided Preference Learning (SPL). The key idea is to construct fictitious swap annotators and use the mirroring property of their preferences to guide the encoder. SPL introduces three components: (1) swap-guided base regularization, (2) Preferential Inverse Autoregressive Flow (P-IAF), and (3) adaptive latent conditioning. Experiments show that SPL mitigates collapse, enriches user-specific latents, and improves preference prediction. Our code and data are available at https://github.com/cobang0111/SPL
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF)は、大規模なAIシステムと人間の価値を結びつけるために広く使われているアプローチである。しかし、RLHFは一般的に単一の普遍的な報酬を仮定し、様々な好みを見落とし、パーソナライゼーションを制限する。変分選好学習(VPL)は、ユーザ固有の潜伏変数を導入することで、この問題に対処しようとしている。その約束にもかかわらず、VPLは後部崩壊に苦しむことがわかった。この現象はVAEでよく知られているが、これまでは選好学習のフレームワークでは特定されていなかった。スパース選好データと過度に表現的なデコーダにより、VPLは潜伏変数を無視し、シングルリワードモデルに戻す。この制限を克服するため,Swarp-guided Preference Learning (SPL)を提案する。キーとなるアイデアは、架空のスワップアノテータを構築し、その好みのミラーリングプロパティを使用してエンコーダをガイドすることである。 SPLは,(1)スワップ誘導ベース正規化,(2)優先逆自己回帰流(P-IAF),(3)適応潜時条件付の3成分を導入している。実験により、SPLは崩壊を緩和し、ユーザ固有の潜伏者を豊かにし、好みの予測を改善することが示された。私たちのコードとデータはhttps://github.com/cobang0111/SPLで利用可能です。

論文の概要: Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback

関連論文リスト