Differentially Private Multi-Armed Bandits in the Shuffle Model
- URL: http://arxiv.org/abs/2106.02900v2
- Date: Tue, 8 Jun 2021 03:41:22 GMT
- Title: Differentially Private Multi-Armed Bandits in the Shuffle Model
- Authors: Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer
- Abstract summary: We give an $(varepsilon,delta)$-differentially private algorithm for the multi-armed bandit (MAB) problem in the shuffle model.
Our upper bound almost matches the regret of the best known algorithms for the centralized model, and significantly outperforms the best known algorithm in the local model.
- Score: 58.22098764071924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We give an $(\varepsilon,\delta)$-differentially private algorithm for the
multi-armed bandit (MAB) problem in the shuffle model with a
distribution-dependent regret of $O\left(\left(\sum_{a\in
[k]:\Delta_a>0}\frac{\log
T}{\Delta_a}\right)+\frac{k\sqrt{\log\frac{1}{\delta}}\log
T}{\varepsilon}\right)$, and a distribution-independent regret of
$O\left(\sqrt{kT\log T}+\frac{k\sqrt{\log\frac{1}{\delta}}\log
T}{\varepsilon}\right)$, where $T$ is the number of rounds, $\Delta_a$ is the
suboptimality gap of the arm $a$, and $k$ is the total number of arms. Our
upper bound almost matches the regret of the best known algorithms for the
centralized model, and significantly outperforms the best known algorithm in
the local model.
Related papers
- Optimal Streaming Algorithms for Multi-Armed Bandits [28.579280943038555]
We study the streaming BAI problem, where the objective is to identify the arm with the maximum reward mean with at least $1-delta$ probability.
We present a single-arm-memory algorithm that achieves a near instance-dependent optimal sample complexity within $O(log Delta-1)$ passes.
arXiv Detail & Related papers (2024-10-23T12:54:04Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Variance-Dependent Best Arm Identification [12.058652922228918]
We study the problem of identifying the best arm in a multi-armed bandit game.
We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms.
We show that our algorithm is optimal up to doubly logarithmic terms.
arXiv Detail & Related papers (2021-06-19T04:13:54Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment [7.4288915613206505]
We study differentially private online learning problems in a environment under both bandit and full information feedback.
For differentially private bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O left(sum_j: Delta_j>0 fracln(T)min leftDelta_j, epsilon right right)$ minimax lower bound.
For the same differentially private full information setting, we also present an $epsilon$-differentially
arXiv Detail & Related papers (2021-02-16T02:48:16Z) - Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure.
We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds.
We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.