Fugu-MT 論文翻訳(概要): Optimal and Adaptive Non-Stationary Dueling Bandits Under a Generalized Borda Criterion

論文の概要: Optimal and Adaptive Non-Stationary Dueling Bandits Under a Generalized Borda Criterion

arxiv url: http://arxiv.org/abs/2403.12950v1
Date: Tue, 19 Mar 2024 17:50:55 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-20 13:04:26.687510
Title: Optimal and Adaptive Non-Stationary Dueling Bandits Under a Generalized Borda Criterion
Title（参考訳）: 一般化ボルダ条件下での最適かつ適応的な非定常ダウリングバンド
Authors: Joe Suk, Arpit Agarwal,
Abstract要約: デュエル・バンディットでは、学習者は腕間の好みのフィードバックを受け取り、腕の後悔は勝者アームに対する過度な最適化によって定義される。本研究では,最初の最適で適応的なボルダ動的後悔の上界を確立する。驚くべきことに、非定常的なボルダデュエルバンディットに対する我々の技術は、コンドルセットの勝者設定内でも改善される。
参考スコア（独自算出の注目度）: 11.770902693413401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In dueling bandits, the learner receives preference feedback between arms, and the regret of an arm is defined in terms of its suboptimality to a winner arm. The more challenging and practically motivated non-stationary variant of dueling bandits, where preferences change over time, has been the focus of several recent works (Saha and Gupta, 2022; Buening and Saha, 2023; Suk and Agarwal, 2023). The goal is to design algorithms without foreknowledge of the amount of change. The bulk of known results here studies the Condorcet winner setting, where an arm preferred over any other exists at all times. Yet, such a winner may not exist and, to contrast, the Borda version of this problem (which is always well-defined) has received little attention. In this work, we establish the first optimal and adaptive Borda dynamic regret upper bound, which highlights fundamental differences in the learnability of severe non-stationarity between Condorcet vs. Borda regret objectives in dueling bandits. Surprisingly, our techniques for non-stationary Borda dueling bandits also yield improved rates within the Condorcet winner setting, and reveal new preference models where tighter notions of non-stationarity are adaptively learnable. This is accomplished through a novel generalized Borda score framework which unites the Borda and Condorcet problems, thus allowing reduction of Condorcet regret to a Borda-like task. Such a generalization was not previously known and is likely to be of independent interest.
Abstract（参考訳）: デュエル・バンディットでは、学習者は腕間の好みのフィードバックを受け取り、腕の後悔は勝者アームに対する過度な最適化によって定義される。より困難で実質的に動機づけられたデュエル・バンディットの非定常的な変種は、時代とともに好みが変わるが、近年のいくつかの作品(Saha and Gupta, 2022; Buening and Saha, 2023; Suk and Agarwal, 2023)の焦点となっている。目標は、変更の量を知ることなく、アルゴリズムを設計することだ。ここでは、多くの既知の結果がコンドルチェットの勝者設定を研究しており、どの腕よりも好まれる腕は常に存在する。しかし、そのような勝者は存在せず、対照的に、この問題のボルダ版(常によく定義されている)はほとんど注目されていない。本研究では,コンドルチェットとボルダの過度な非定常性の学習性に基礎的な差異を生じさせる,最初の最適かつ適応的なボルダ動的後悔上限を確立する。意外なことに、非定常なボルダデュエルブレイトに対する我々の手法は、コンドルセットの勝者設定における改善率をもたらし、非定常性のより厳密な概念が適応的に学習可能な新しい選好モデルを明らかにする。これは、ボルダとコンドルチェの問題を統一する新しい一般化されたボルダスコアフレームワークによって達成され、これにより、ボルダのようなタスクに対するコンドルチェットの後悔を減らすことができる。このような一般化は以前は知られておらず、独立した関心を持つ可能性が高い。

論文の概要: Optimal and Adaptive Non-Stationary Dueling Bandits Under a Generalized Borda Criterion

関連論文リスト