Fugu-MT 論文翻訳(概要): Online Learning and Equilibrium Computation with Ranking Feedback

論文の概要: Online Learning and Equilibrium Computation with Ranking Feedback

arxiv url: http://arxiv.org/abs/2603.19221v1
Date: Thu, 19 Mar 2026 17:59:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.327683
Title: Online Learning and Equilibrium Computation with Ranking Feedback
Title（参考訳）: ランク付けフィードバックを用いたオンライン学習と平衡計算
Authors: Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang,
Abstract要約: 本研究では,学習者が各段階において提案された行動の集合に対してのみ,一品位を観察するオンライン学習モデルについて検討する。即時的効用評価フィードバックでは, サブリニアな後悔は不可能であることを示す。我々は,実用性列が線形全変量を持つという仮定を付加して,サブ線形後悔を実現するアルゴリズムを開発した。
参考スコア（独自算出の注目度）: 47.07396244650246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.
Abstract（参考訳）: オンライン学習は、任意の、おそらくは逆の環境において、シーケンシャルな意思決定において広範囲に研究され、ゲーム理論における平衡計算と密接に関連している。既存のオンライン学習アルゴリズムの多くは環境からのemph{numeric}ユーティリティフィードバックに依存している。本稿では,学習者が各段階において提案した一連の行動に対して,emph{ rank}のみを観測するオンライン学習モデルについて検討する。本稿では,現在の時刻における \emph{instantaneous} ユーティリティによるランキングと,現在の時刻における \emph{full-information} と \emph{bandit} フィードバック設定による \emph{time-average} ユーティリティによるランキングの2つについて検討する。標準の外部回帰測定値を用いて、即時効用ランキングのフィードバックにより、サブ線形後悔は不可能であることを示す。さらに、ランク付けモデルが比較的決定論的である場合、温度が十分に小さいプラケット・リュックモデルの下では、時間平均ユーティリティランキングフィードバックではサブ線形後悔も不可能である。そこで我々は, 効用系列が線形全変量を持つという仮定を付加して, サブ線形後悔を実現するアルゴリズムを開発した。特に、情報全体の時間平均ユーティリティランキングフィードバックのために、この追加仮定を除去することができる。その結果、正規形式ゲーム内の全てのプレイヤーが我々のアルゴリズムに従うと、繰り返しプレイは近似した粗い相関平衡をもたらす。また、オンラインの大規模言語モデルルーティングタスクにおいて、アルゴリズムの有効性を実証する。

論文の概要: Online Learning and Equilibrium Computation with Ranking Feedback

関連論文リスト