Fugu-MT 論文翻訳(概要): Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

論文の概要: Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

arxiv url: http://arxiv.org/abs/2603.20453v1
Date: Fri, 20 Mar 2026 19:34:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:38.923564
Title: Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret
Title（参考訳）: マルチソース不完全な選好からの強化学習:Best-of-Both-Regimes Regret
Authors: Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami,
Abstract要約: 我々は, 累積的不完全化予算を用いて, エンフルティソースの不完全性選好からエピソードRLを考察した。我々は,最良な登録行動を示す,後悔$tildeO(sqrtK/M+)$の統一アルゴリズムを提案する。
参考スコア（独自算出の注目度）: 71.69884486156359
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $\tildeΩ(\max\{\sqrt{K/M},ω\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $\tildeΩ(\min\{ω\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.
Abstract（参考訳）: 人間のフィードバックからの強化学習(RLHF)は、厳格な報酬をペアの軌道上の好みに置き換えるが、後悔指向の理論では、選好ラベルは単一目標から一貫して生成されると仮定することが多い。しかし、実践的なRLHFシステムでは、フィードバックは典型的には「emph{multi-source}」(注釈、専門家、報酬モデル、ヒューリスティックス)であり、主観性、専門性の変化、アノテーション/モデリングアーティファクトによる体系的かつ永続的なミスマッチを示すことができる。本研究では,emph{multi-source imperfect preferences} のエピソード RL を累積的不完全度予算により検討する。我々は,不完全度が小さい場合(M$は情報源数である場合)に$M$依存統計ゲインを達成し,不完全度が大きい場合の$ω$に対する避けられない付加的依存性を保ちながら,最良なボトム・レジズ動作を示す,後悔$\tilde{O}(\sqrt{K/M}+ω)$の統一アルゴリズムを提案する。これを下界の$\tildeΩ(\max\{\sqrt{K/M},ω\})$で補うと、$M$と$ω$に対する避けられない依存に関して最高の改善が得られる。技術的には、本手法は、不完全適応重み付け比較学習、隠れフィードバックによる分布シフトを制御するための値目標遷移推定、重み付けされた目的を分析可能に保つためのサブインパタンスサンプリング、マルチソースフィードバックがRLHFを確実に改善し、累積不完全さが根本的に制限する際の後悔の保証を与える。

論文の概要: Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

関連論文リスト