Unimodal Mono-Partite Matching in a Bandit Setting
- URL: http://arxiv.org/abs/2208.01511v1
- Date: Tue, 2 Aug 2022 14:55:50 GMT
- Title: Unimodal Mono-Partite Matching in a Bandit Setting
- Authors: Romaric Gaudel (ENSAI, CREST), Matthieu Rodet (ENS Rennes)
- Abstract summary: We create an algorithm with an expected regret matching $O(fracLlog(L)Deltalog(T))$ with $2L$ players, $T$ and a minimum reward gap $Delta$.
Some experimental results finally show these theoretical results are corroborated in practice.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle a new emerging problem, which is finding an optimal monopartite
matching in a weighted graph. The semi-bandit version, where a full matching is
sampled at each iteration, has been addressed by \cite{ADMA}, creating an
algorithm with an expected regret matching $O(\frac{L\log(L)}{\Delta}\log(T))$
with $2L$ players, $T$ iterations and a minimum reward gap $\Delta$. We reduce
this bound in two steps. First, as in \cite{GRAB} and \cite{UniRank} we use the
unimodality property of the expected reward on the appropriate graph to design
an algorithm with a regret in $O(L\frac{1}{\Delta}\log(T))$. Secondly, we show
that by moving the focus towards the main question `\emph{Is user $i$ better
than user $j$?}' this regret becomes
$O(L\frac{\Delta}{\tilde{\Delta}^2}\log(T))$, where $\Tilde{\Delta} > \Delta$
derives from a better way of comparing users. Some experimental results finally
show these theoretical results are corroborated in practice.
Related papers
- Online Matrix Completion: A Collaborative Approach with Hott Items [19.781869063637387]
We investigate the low rank matrix completion problem in an online setting with $M$ users, $N$ items, $T$ rounds, and an unknown rank-$r$ reward matrix $Rin mathbbRMtimes N$.
arXiv Detail & Related papers (2024-08-11T18:49:52Z) - UniRank: Unimodal Bandit Algorithm for Online Ranking [0.0]
We create an algorithm with an expected regret matching $O(fracLlog(L)Deltalog(T))$ with $2L$ players, $T$ and a minimum reward gap $Delta$.
Some experimental results finally show these theoretical results are corroborated in practice.
arXiv Detail & Related papers (2022-08-02T15:01:33Z) - Delaytron: Efficient Learning of Multiclass Classifiers with Delayed
Bandit Feedbacks [6.624726878647541]
Adaptive Delaytron achieves a regret bound of $mathcalOleft(sqrtfrac2 Kgammaleft[fracT2+left(2+fracL2R2Vert WVert_F2right)sum_t=1Td_tright.
We show that Adaptive Delaytron achieves a regret bound of $mathcalOleft(sqrtfrac2 Kgammaleft[fracT2
arXiv Detail & Related papers (2022-05-17T11:12:20Z) - Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model.
We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z) - Fast Graph Sampling for Short Video Summarization using Gershgorin Disc
Alignment [52.577757919003844]
We study the problem of efficiently summarizing a short video into several paragraphs, leveraging recent progress in fast graph sampling.
Experimental results show that our algorithm achieves comparable video summarization as state-of-the-art methods, at a substantially reduced complexity.
arXiv Detail & Related papers (2021-10-21T18:43:00Z) - Differentially Private Multi-Armed Bandits in the Shuffle Model [58.22098764071924]
We give an $(varepsilon,delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model.
Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
arXiv Detail & Related papers (2021-06-05T14:11:01Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.