Fugu-MT 論文翻訳(概要): UCB-based Algorithms for Multinomial Logistic Regression Bandits

論文の概要: UCB-based Algorithms for Multinomial Logistic Regression Bandits

arxiv url: http://arxiv.org/abs/2103.11489v1
Date: Sun, 21 Mar 2021 21:09:55 GMT
ステータス: 翻訳完了
システム内更新日: 2021-03-23 14:53:14.408936
Title: UCB-based Algorithms for Multinomial Logistic Regression Bandits
Title（参考訳）: UCBに基づく多項ロジスティック回帰帯域のアルゴリズム
Authors: Sanae Amani, Christos Thrampoulidis
Abstract要約: 我々は、MNL(Multinomial logit)を用いて、K+1geq 2$の可能な結果の確率をモデル化する。 MNL-UCBは, 問題依存定数に小さな依存を伴い, $tildemathcalO(dKsqrtT)$を後悔する, 上位信頼境界(UCB)に基づくアルゴリズムである。
参考スコア（独自算出の注目度）: 31.67685495996986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Out of the rich family of generalized linear bandits, perhaps the most well studied ones are logisitc bandits that are used in problems with binary rewards: for instance, when the learner/agent tries to maximize the profit over a user that can select one of two possible outcomes (e.g., `click' vs `no-click'). Despite remarkable recent progress and improved algorithms for logistic bandits, existing works do not address practical situations where the number of outcomes that can be selected by the user is larger than two (e.g., `click', `show me later', `never show again', `no click'). In this paper, we study such an extension. We use multinomial logit (MNL) to model the probability of each one of $K+1\geq 2$ possible outcomes (+1 stands for the `not click' outcome): we assume that for a learner's action $\mathbf{x}_t$, the user selects one of $K+1\geq 2$ outcomes, say outcome $i$, with a multinomial logit (MNL) probabilistic model with corresponding unknown parameter $\bar{\boldsymbol\theta}_{\ast i}$. Each outcome $i$ is also associated with a revenue parameter $\rho_i$ and the goal is to maximize the expected revenue. For this problem, we present MNL-UCB, an upper confidence bound (UCB)-based algorithm, that achieves regret $\tilde{\mathcal{O}}(dK\sqrt{T})$ with small dependency on problem-dependent constants that can otherwise be arbitrarily large and lead to loose regret bounds. We present numerical simulations that corroborate our theoretical results.
Abstract（参考訳）: 一般化された線形包帯の豊かなファミリーのうち、おそらく最もよく研究されているものは、二進報酬の問題に使用される対数的包帯である:例えば、学習者/エージェントが2つの可能な結果のうちの1つを選択できるユーザに対して利益を最大化しようとする場合(例:「クリック」対「ノークリック」)。最近の顕著な進歩とロジスティック・バンディットのアルゴリズムの改善にもかかわらず、既存の作品は、ユーザーが選択できる結果の数が2より大きい(例えば、『click』、『show me later』、『never show again』、『no click』など)実用的な状況に対処していない。本稿では,そのような拡張について検討する。我々は、学習者のアクション $\mathbf{x}_t$ に対して、ユーザは$k+1\geq 2$ の結果の1つを選択し、その結果 $i$ を、対応する未知パラメータ $\bar{\boldsymbol\theta}_{\ast i}$ を持つマルチノミナルロジット (mnl) 確率モデルで選択する。それぞれの結果$i$は収益パラメータ$\rho_i$と関連付けられており、目標は期待される収益を最大化することである。この問題に対して、上信頼境界(UCB)に基づくアルゴリズムであるMNL-UCB(MNL-UCB)を、任意に大きい問題依存定数に小さな依存を伴って、後悔する$\tilde{\mathcal{O}}(dK\sqrt{T})$を達成する。理論的結果を裏付ける数値シミュレーションを提案する。

関連論文リスト

Enjoying Non-linearity in Multinomial Logistic Bandits [66.36004256710824]
我々は,学習者が環境と対話する一般化線形帯域幅の変種である多項ロジスティック帯域幅問題を考察する。我々は、$kappa_*$の定義を多項集合に拡張し、問題の非線形性を活用する効率的なアルゴリズムを提案する。我々のメソッドは、代入$ smashwidetildemathcalO(Kd sqrtT/kappa_*) $, $K$はアクションの数で、$kappa_*
論文参考訳（メタデータ） (2025-07-07T08:18:25Z)
Sample and Computationally Efficient Robust Learning of Gaussian Single-Index Models [37.42736399673992]
シングルインデックスモデル (SIM) は $sigma(mathbfwast cdot mathbfx)$ という形式の関数であり、$sigma: mathbbR to mathbbR$ は既知のリンク関数であり、$mathbfwast$ は隠れ単位ベクトルである。適切な学習者が$L2$-error of $O(mathrmOPT)+epsilon$。
論文参考訳（メタデータ） (2024-11-08T17:10:38Z)
Combinatorial Logistic Bandits [30.829239785016934]
我々はロジスティック・バンディット(CLogB)と呼ばれる新しいフレームワークを紹介する。各ラウンドでは、ベースアームのサブセット(スーパーアームと呼ばれる)が選択され、各ベースアームの結果はバイナリとなる。実世界のデータセットの実験では、ベンチマークアルゴリズムと比較してアルゴリズムの性能が優れていた。
論文参考訳（メタデータ） (2024-10-22T14:52:46Z)
Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation [67.8414514524356]
本稿では,MNL関数近似を用いたMDPの新しいクラスについて検討し,状態空間上の確率分布の正当性を保証する。非線型関数の導入は、計算効率と統計効率の両方において大きな課題を提起する。我々は,$mathcalO(1)$$コストで同じ後悔を実現するアルゴリズムを提案する。
論文参考訳（メタデータ） (2024-05-27T11:31:54Z)
No-Regret M${}^{\natural}$-Concave Function Maximization: Stochastic Bandit Algorithms and NP-Hardness of Adversarial Full-Information Setting [23.188831772813103]
オンラインM$natural$-concave関数問題について検討し,Murota と Shioura (1999) によるインタラクティブ版について検討した。バンドイット設定では、$O(T-1/2)$-simple regretと$O(T2/3)$-regretアルゴリズムを、M$natural$-concave関数のノイズ値オーラクルに$T$倍のアクセスで提示する。完全な情報フィードバックであっても,ラウンド毎に実行されたアルゴリズムは,任意の一定の$cに対して,O(T1-c)$後悔を達成できないことを証明しています。
論文参考訳（メタデータ） (2024-05-21T01:31:44Z)
Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
我々は、M$のクライアントが中央サーバと通信し、線形文脈の帯域幅問題を解決するための連合線形帯域幅モデルについて検討する。逆有限作用集合のユニークな問題に対処するため、FedSupLinUCBアルゴリズムを提案する。我々は、FedSupLinUCBが$tildeO(sqrtd T)$の完全後悔を達成したことを証明している。
論文参考訳（メタデータ） (2023-11-02T03:41:58Z)
Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions [49.69852011882769]
一般化線形モデル (GLMs) の重畳雑音の存在下での回帰問題に対する最初のアルゴリズムを示す。本稿では,この問題に最も一般的な分布非依存設定で対処するアルゴリズムを提案する。これは、サンプルの半分以上を任意に破損させる難聴ノイズを持つGLMレグレッションに対する最初の新しいアルゴリズムによる結果である。
論文参考訳（メタデータ） (2023-09-20T21:41:59Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Multinomial Logit Contextual Bandits: Provable Optimality and Practicality [15.533842336139063]
パラメータが不明な多項式ロギット(MNL)選択モデルによってユーザ選択が与えられる順序選択選択問題を検討する。本稿では,このMNLコンテクストバンディットに対する高信頼境界に基づくアルゴリズムを提案する。本稿では,アルゴリズムの単純な変種が,幅広い重要なアプリケーションに対して最適な後悔を与えることを示す。
論文参考訳（メタデータ） (2021-03-25T15:42:25Z)
Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally [58.463668865380946]
状態空間 $mathcalS$ を用いたエピソードマルコフ決定過程 (MDPs) における模擬学習の統計的限界について検討する。 rajaraman et al (2020) におけるmdアルゴリズムを用いた準最適性に対する上限 $o(|mathcals|h3/2/n)$ を定式化する。 Omega(H3/2/N)$ $mathcalS|geq 3$ であるのに対して、未知の遷移条件はよりシャープレートに悩まされる。
論文参考訳（メタデータ） (2021-02-25T15:50:19Z)
Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity [59.34067736545355]
S$状態、$A$アクション、割引係数$gamma in (0,1)$、近似しきい値$epsilon > 0$の MDP が与えられた場合、$epsilon$-Optimal Policy を学ぶためのモデルなしアルゴリズムを提供する。十分小さな$epsilon$の場合、サンプルの複雑さで改良されたアルゴリズムを示す。
論文参考訳（メタデータ） (2020-06-06T13:34:41Z)
Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
我々は,損失予測の助けを借りて,文脈的包帯における学習を研究する。最適な後悔は$mathcalO(minsqrtT, sqrtmathcalETfrac13)$である。
論文参考訳（メタデータ） (2020-03-04T07:36:38Z)
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
本稿では,決定論的システムにおける関数近似を用いたQ$学習の問題について検討する。もし$delta = Oleft(rho/sqrtdim_Eright)$なら、$Oleft(dim_Eright)$を使って最適なポリシーを見つけることができる。
論文参考訳（メタデータ） (2020-02-17T18:41:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。