Related papers: UniRank: Unimodal Bandit Algorithm for Online Ranking

UniRank: Unimodal Bandit Algorithm for Online Ranking

URL: http://arxiv.org/abs/2208.01515v1
Date: Tue, 2 Aug 2022 15:01:33 GMT
Title: UniRank: Unimodal Bandit Algorithm for Online Ranking
Authors: Camille-Sovanneary Gauthier (LACODAM), Romaric Gaudel (ENSAI, CREST), Elisa Fromont (LACODAM, IUF, UR1)
Abstract summary: We create an algorithm with an expected regret matching $O(fracLlog(L)Deltalog(T))$ with $2L$ players, $T$ and a minimum reward gap $Delta$. Some experimental results finally show these theoretical results are corroborated in practice.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We tackle a new emerging problem, which is finding an optimal monopartite matching in a weighted graph. The semi-bandit version, where a full matching is sampled at each iteration, has been addressed by \cite{ADMA}, creating an algorithm with an expected regret matching $O(\frac{L\log(L)}{\Delta}\log(T))$ with $2L$ players, $T$ iterations and a minimum reward gap $\Delta$. We reduce this bound in two steps. First, as in \cite{GRAB} and \cite{UniRank} we use the unimodality property of the expected reward on the appropriate graph to design an algorithm with a regret in $O(L\frac{1}{\Delta}\log(T))$. Secondly, we show that by moving the focus towards the main question `\emph{Is user $i$ better than user $j$?}' this regret becomes $O(L\frac{\Delta}{\tilde{\Delta}^2}\log(T))$, where $\Tilde{\Delta} > \Delta$ derives from a better way of comparing users. Some experimental results finally show these theoretical results are corroborated in practice.

Related papers

Online Matrix Completion: A Collaborative Approach with Hott Items [19.781869063637387]
We investigate the low rank matrix completion problem in an online setting with $M$ users, $N$ items, $T$ rounds, and an unknown rank-$r$ reward matrix $Rin mathbbRMtimes N$.
arXiv Detail & Related papers (2024-08-11T18:49:52Z)
Unimodal Mono-Partite Matching in a Bandit Setting [0.0]
We create an algorithm with an expected regret matching $O(fracLlog(L)Deltalog(T))$ with $2L$ players, $T$ and a minimum reward gap $Delta$. Some experimental results finally show these theoretical results are corroborated in practice.
arXiv Detail & Related papers (2022-08-02T14:55:50Z)
Exploiting Correlation to Achieve Faster Learning Rates in Low-Rank Preference Bandits [21.70169149901781]
We investigate whether models with a simple correlation structure, e.g. low rank, can result in faster learning rates. In particular, we introduce a new class of emphBlock-Rank based RUM model, where the best item is shown to be $(epsilon,delta)$-PAC learnable with only $O(r epsilon-2 log(n/delta))$ samples.
arXiv Detail & Related papers (2022-02-23T21:34:20Z)
Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model. We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z)
Contextual Recommendations and Low-Regret Cutting-Plane Algorithms [49.91214213074933]
We consider the following variant of contextual linear bandits motivated by routing applications in navigational engines and recommendation systems. We design novel cutting-plane algorithms with low "regret" -- the total distance between the true point $w*$ and the hyperplanes the separation oracle returns.
arXiv Detail & Related papers (2021-06-09T05:39:05Z)
Differentially Private Multi-Armed Bandits in the Shuffle Model [58.22098764071924]
We give an $(varepsilon,delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model. Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
arXiv Detail & Related papers (2021-06-05T14:11:01Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure. We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
Contextual Bandits with Side-Observations [10.248045133793287]
We investigate contextual bandits in the presence of side-observations across arms in order to design recommendation algorithms for users connected via social networks. We show that a naive application of existing learning algorithms results in $Oleft(Nlog Tright)$ regret, where $N$ is the number of users.
arXiv Detail & Related papers (2020-06-06T19:34:50Z)
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems. We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.