Fugu-MT 論文翻訳(概要): Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

論文の概要: Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

arxiv url: http://arxiv.org/abs/2510.18713v1
Date: Tue, 21 Oct 2025 15:11:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.751205
Title: Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options
Title（参考訳）: ペアワイズ比較を超えた嗜好に基づく強化学習:複数選択肢のメリット
Authors: Joongkyu Lee, Seouh-won Yi, Min-hwan Oh,
Abstract要約: オンライン嗜好に基づく強化学習(PbRL)を,サンプル効率の向上を目的として検討した。本稿では,提案するサブセット内の平均不確実性を最大化し,複数の動作を選択するアルゴリズムであるM-AUPOを提案する。 M-AUPO が $tildemathcalOleft( fracdT sqrt sum_t=1T frac1|S_t| right)$ の準最適ギャップを達成できることを証明する。
参考スコア（独自算出の注目度）: 35.41703011973504
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $\Omega \left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
Abstract（参考訳）: オンライン嗜好に基づく強化学習(PbRL)を,サンプル効率の向上を目的として検討した。 PbRLの最近の経験的成功、特に大規模言語モデル(LLM)の整合性によって、理論的な研究が発展しつつある一方で、既存の研究はペア比較のみに焦点を当てている。いくつかの最近の研究(Zhu et al , 2023, Mukherjee et al , 2024, Thekumparampil et al , 2024)では、複数の比較とランク付けフィードバックを用いて検討されているが、それらの性能保証は改善されず、豊富な情報があるにもかかわらずフィードバック長が増加するにつれて、さらに悪化する可能性がある。このギャップに対処するために、アクションサブセットよりもフィードバックをランク付けするためのPlackett-Luce(PL)モデルを採用し、提案するサブセット内の平均不確実性を最大化して複数のアクションを選択するアルゴリズムであるM-AUPOを提案する。 M-AUPO が $\tilde{\mathcal{O}}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$ であることを示す。この結果は、より大きな部分集合が直接的に性能を向上させることを示し、特にその境界は未知のパラメータのノルムへの指数的依存を回避している。さらに、ほぼ一致する$\Omega \left( \frac{d}{K \sqrt{T}} \right)$という下界を確立し、$K$は最大部分集合サイズである。我々の知る限りでは、これはPbRLにおける最初の理論的結果であり、サブセットサイズの関数としてのサンプル効率の向上を明示的に示すランキングフィードバックである。

論文の概要: Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

関連論文リスト