UCB-based Algorithms for Multinomial Logistic Regression Bandits
- URL: http://arxiv.org/abs/2103.11489v1
- Date: Sun, 21 Mar 2021 21:09:55 GMT
- Title: UCB-based Algorithms for Multinomial Logistic Regression Bandits
- Authors: Sanae Amani, Christos Thrampoulidis
- Abstract summary: We use multinomial logit (MNL) to model the probability of each one of $K+1geq 2$ possible outcomes.
We present MNL-UCB, an upper confidence bound (UCB)-based algorithm, that regret achieves $tildemathcalO(dKsqrtT)$ with small dependency on problem-dependent constants.
- Score: 31.67685495996986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Out of the rich family of generalized linear bandits, perhaps the most well
studied ones are logisitc bandits that are used in problems with binary
rewards: for instance, when the learner/agent tries to maximize the profit over
a user that can select one of two possible outcomes (e.g., `click' vs
`no-click'). Despite remarkable recent progress and improved algorithms for
logistic bandits, existing works do not address practical situations where the
number of outcomes that can be selected by the user is larger than two (e.g.,
`click', `show me later', `never show again', `no click'). In this paper, we
study such an extension. We use multinomial logit (MNL) to model the
probability of each one of $K+1\geq 2$ possible outcomes (+1 stands for the
`not click' outcome): we assume that for a learner's action $\mathbf{x}_t$, the
user selects one of $K+1\geq 2$ outcomes, say outcome $i$, with a multinomial
logit (MNL) probabilistic model with corresponding unknown parameter
$\bar{\boldsymbol\theta}_{\ast i}$. Each outcome $i$ is also associated with a
revenue parameter $\rho_i$ and the goal is to maximize the expected revenue.
For this problem, we present MNL-UCB, an upper confidence bound (UCB)-based
algorithm, that achieves regret $\tilde{\mathcal{O}}(dK\sqrt{T})$ with small
dependency on problem-dependent constants that can otherwise be arbitrarily
large and lead to loose regret bounds. We present numerical simulations that
corroborate our theoretical results.
Related papers
- Sample and Computationally Efficient Robust Learning of Gaussian Single-Index Models [37.42736399673992]
A single-index model (SIM) is a function of the form $sigma(mathbfwast cdot mathbfx)$, where $sigma: mathbbR to mathbbR$ is a known link function and $mathbfwast$ is a hidden unit vector.
We show that a proper learner attains $L2$-error of $O(mathrmOPT)+epsilon$, where $
arXiv Detail & Related papers (2024-11-08T17:10:38Z) - Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation [67.8414514524356]
We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space.
Despite its significant benefits, incorporating the non-linear function raises substantial challenges in both statistical and computational efficiency.
We propose an algorithm that achieves a regret of $widetildemathcalO(dH2sqrtK + kappa-1d2H2)$, eliminating the dependence on $kappa-1$ in the dominant term for the first time
arXiv Detail & Related papers (2024-05-27T11:31:54Z) - No-Regret M${}^{\natural}$-Concave Function Maximization: Stochastic Bandit Algorithms and NP-Hardness of Adversarial Full-Information Setting [23.188831772813103]
We study online M$natural$-concave function problems, which are interactive versions of the problem studied by Murota and Shioura (1999).
For the bandit setting, we present $O(T-1/2)$-simple regret and $O(T2/3)$-regret algorithms under $T$ times access to noisy value oracles of M$natural$-concave functions.
We prove that even with full-information feedback, no algorithms that run in time per round can achieve $O(T1-c)$ regret for any constant $c
arXiv Detail & Related papers (2024-05-21T01:31:44Z) - Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem.
To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm.
We prove that FedSupLinUCB achieves a total regret of $tildeO(sqrtd T)$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model.
arXiv Detail & Related papers (2023-11-02T03:41:58Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Multinomial Logit Contextual Bandits: Provable Optimality and
Practicality [15.533842336139063]
We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown.
We propose upper confidence bound based algorithms for this MNL contextual bandit.
We show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
arXiv Detail & Related papers (2021-03-25T15:42:25Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.