Fugu-MT 論文翻訳(概要): Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments

論文の概要: Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments

arxiv url: http://arxiv.org/abs/2311.04528v1
Date: Wed, 8 Nov 2023 08:33:03 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-09 16:42:07.092164
Title: Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments
Title（参考訳）: 位置ベースクリックモデルでランク付けするバンド学習:パーソナライズと平等な治療
Authors: Tianchen Zhou, Jia Liu, Yang Jiao, Chaosheng Dong, Yetian Chen, Yan Gao, Yi Sun
Abstract要約: We developed two unified greed- and UCB-based policy called GreedyRank and UCBRank。 GreedyRank と UCBRank はともに,パーソナライズと平等な治療のために,いつでも$O(sqrttln t) と $O(sqrttln t) のサブライン後悔を享受している。
参考スコア（独自算出の注目度）: 26.59975967048243
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Online learning to rank (ONL2R) is a foundational problem for recommender systems and has received increasing attention in recent years. Among the existing approaches for ONL2R, a natural modeling architecture is the multi-armed bandit framework coupled with the position-based click model. However, developing efficient online learning policies for MAB-based ONL2R with position-based click models is highly challenging due to the combinatorial nature of the problem, and partial observability in the position-based click model. To date, results in MAB-based ONL2R with position-based click models remain rather limited, which motivates us to fill this gap in this work. Our main contributions in this work are threefold: i) We propose the first general MAB framework that captures all key ingredients of ONL2R with position-based click models. Our model considers personalized and equal treatments in ONL2R ranking recommendations, both of which are widely used in practice; ii) Based on the above analytical framework, we develop two unified greed- and UCB-based policies called GreedyRank and UCBRank, each of which can be applied to personalized and equal ranking treatments; and iii) We show that both GreedyRank and UCBRank enjoy $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regret for personalized and equal treatment, respectively. For the fundamentally hard equal ranking treatment, we identify classes of collective utility functions and their associated sufficient conditions under which $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regrets are still achievable for GreedyRank and UCBRank, respectively. Our numerical experiments also verify our theoretical results and demonstrate the efficiency of GreedyRank and UCBRank in seeking the optimal action under various problem settings.
Abstract（参考訳）: online learning to rank (onl2r) はレコメンダシステムの基礎的な問題であり、近年注目を集めている。 ONL2Rの既存のアプローチの中で、自然モデリングアーキテクチャは、位置ベースクリックモデルと組み合わせたマルチアームバンディットフレームワークである。しかし、位置ベースクリックモデルを用いたMABベースのONL2Rの効率的なオンライン学習ポリシーの開発は、問題の組合せの性質と位置ベースクリックモデルにおける部分的可観測性により、非常に困難である。現在までに、位置ベースのクリックモデルを持つMABベースのONL2Rの結果はかなり限定的であり、この作業のギャップを埋める動機となっている。この仕事の主な貢献は3つあります。 i) 位置ベースクリックモデルを用いてONL2Rのすべての重要な成分をキャプチャする最初の汎用MABフレームワークを提案する。本モデルでは,ONL2Rの格付け勧告におけるパーソナライズと同等の扱いについて検討する。二上記の分析枠組みに基づき、GreedyRank とUCBRank という2つの統合されたgreed- and UCB-based Policyを策定し、それぞれがパーソナライズ及び同等のランキング処理に適用することができる。 iii) greedyrank と ucbrank がそれぞれ$o(\sqrt{t}\ln t)$ と $o(\sqrt{t\ln t})$ を享受していることを示す。根本的な難解な等ランク処理では、集合的ユーティリティ関数のクラスとそれに関連する十分条件を識別し、そこでは$o(\sqrt{t}\ln t)$ と $o(\sqrt{t\ln t})$ がそれぞれグリーディランクとucbrankに対して依然として達成可能である。また, 数値実験により, 種々の問題条件下での最適行動を求める上でのGreedyRankとUCBRankの有効性を検証した。

関連論文リスト

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
MM-RLHF, $mathbf120k$ fine-fine, human-annotated preference comparison pairsを含むデータセットを紹介する。本稿では,報酬モデルの品質向上とアライメントアルゴリズムの効率向上のために,いくつかの重要なイノベーションを提案する。我々のアプローチは、$mathbf10$の異なる次元と$mathbf27$のベンチマークで厳格に評価されている。
論文参考訳（メタデータ） (2025-02-14T18:59:51Z)
Adaptively Learning to Select-Rank in Online Platforms [34.258659206323664]
本研究は、異種ユーザの候補プールからアイテムを適応的にランク付けすることの課題に対処する。本研究では,多様なユーザの好みや項目位置の影響を考慮に入れたユーザ応答モデルを構築した。シミュレーションと実世界の両方のデータセットで実施された実験は、アルゴリズムがベースラインを上回っていることを示している。
論文参考訳（メタデータ） (2024-06-07T15:33:48Z)
Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption [60.958746600254884]
本研究は、モデルベース強化学習(RL)における敵対的腐敗の課題に取り組む。本稿では,MLE に対する不確実性重みとして全変量 (TV) に基づく情報比を利用する,汚損楽観的 MLE (CR-OMLE) アルゴリズムを提案する。我々は、重み付け手法をオフライン設定にまで拡張し、汚損性悲観的MLE (CR-PMLE) というアルゴリズムを提案する。
論文参考訳（メタデータ） (2024-02-14T07:27:30Z)
Simultaneously Learning Stochastic and Adversarial Bandits under the Position-Based Model [9.945948163150874]
本研究は, 位置ベースモデルに基づくオンライン学習における課題のランク付けに関する研究である。提案アルゴリズムは,対向環境において$O(logT)$後悔を同時に達成し,対向環境において$O(msqrtnT)$後悔を同時に達成する。実験により,本アルゴリズムは,既存手法と競合する環境下で同時に学習できることが確認された。
論文参考訳（メタデータ） (2022-07-12T10:00:14Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
本研究は,RL(Human-in-the-loop reinforcement learning)を軌道的嗜好で検討する。各ステップで数値的な報酬を受ける代わりに、エージェントは人間の監督者から軌道上のペアよりも優先される。一般関数近似を用いたPbRLの楽観的モデルベースアルゴリズムを提案する。
論文参考訳（メタデータ） (2022-05-23T09:03:24Z)
Learning-To-Ensemble by Contextual Rank Aggregation in E-Commerce [8.067201256886733]
本稿では,アンサンブルモデルを文脈的ランクアグリゲータに置き換えた新しいラーニング・トゥ・エンサンブル・フレームワークRAEGOを提案する。 RA-EGOは当社のオンラインシステムにデプロイされ、収益を大幅に改善しました。
論文参考訳（メタデータ） (2021-07-19T03:24:06Z)
Online Apprenticeship Learning [58.45089581278177]
見習い学習(AL)では、コスト関数にアクセスせずにマルコフ決定プロセス(MDP)が与えられます。目標は、事前に定義されたコスト関数のセットで専門家のパフォーマンスに一致するポリシーを見つけることです。ミラー下降型ノンレグレットアルゴリズムを2つ組み合わせることで,OAL問題を効果的に解くことができることを示す。
論文参考訳（メタデータ） (2021-02-13T12:57:51Z)
Efficient Online Learning of Optimal Rankings: Dimensionality Reduction via Gradient Descent [47.66497212729108]
一般化されたMin-Sum-Set-Cover問題は上記の設定の形式モデルとして機能する。 GMSSCインタイムでの後悔度を低くする方法を示す。
論文参考訳（メタデータ） (2020-11-05T13:52:34Z)
Learning to Rank under Multinomial Logit Choice [6.929312022493406]
コンテンツの最適順序付けを学ぶことは、ウェブサイト設計において重要な課題である。本稿では,この問題に対する$Omega(sqrtJT)$lowbound,$tildeO(sqrtJT)$ upperbound on the regret of the UCBアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-09-07T16:15:12Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
モデルに基づくMARLは、Nash平衡値(NE)を求めるために$tilde O(|S||B|(gamma)-3epsilon-2)$のサンプル複雑性を実現する。また、アルゴリズムが報酬に依存しない場合、そのようなサンプル境界は最小値(対数因子まで)であり、アルゴリズムは報酬知識のない遷移サンプルを問合せする。
論文参考訳（メタデータ） (2020-07-15T03:25:24Z)
Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition [59.34067736545355]
有限水平型マルコフ決定過程(MDP)における強化学習問題を,S$状態,A$動作,エピソード長$H$を用いて検討した。モデルフリーアルゴリズム UCB-Advantage を提案し、$T = KH$ および $K$ が再生すべきエピソード数である場合に $tildeO(sqrtH2SAT)$ regret を達成することを証明した。
論文参考訳（メタデータ） (2020-04-21T14:00:06Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。