(Locally) Differentially Private Combinatorial Semi-Bandits
- URL: http://arxiv.org/abs/2006.00706v2
- Date: Fri, 3 Jul 2020 10:52:42 GMT
- Title: (Locally) Differentially Private Combinatorial Semi-Bandits
- Authors: Xiaoyu Chen, Kai Zheng, Zixin Zhou, Yunchang Yang, Wei Chen, Liwei
Wang
- Abstract summary: We study Combinatorial Semi-Bandits (CSB) that is an extension of classic Multi-Armed Bandits (MAB) under Differential Privacy (DP) and stronger Local Differential Privacy (LDP) setting.
- Score: 26.462686161971803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study Combinatorial Semi-Bandits (CSB) that is an extension
of classic Multi-Armed Bandits (MAB) under Differential Privacy (DP) and
stronger Local Differential Privacy (LDP) setting. Since the server receives
more information from users in CSB, it usually causes additional dependence on
the dimension of data, which is a notorious side-effect for privacy preserving
learning. However for CSB under two common smoothness assumptions
\cite{kveton2015tight,chen2016combinatorial}, we show it is possible to remove
this side-effect. In detail, for $B_{\infty}$-bounded smooth CSB under either
$\varepsilon$-LDP or $\varepsilon$-DP, we prove the optimal regret bound is
$\Theta(\frac{mB^2_{\infty}\ln T } {\Delta\epsilon^2})$ or
$\tilde{\Theta}(\frac{mB^2_{\infty}\ln T} { \Delta\epsilon})$ respectively,
where $T$ is time period, $\Delta$ is the gap of rewards and $m$ is the number
of base arms, by proposing novel algorithms and matching lower bounds. For
$B_1$-bounded smooth CSB under $\varepsilon$-DP, we also prove the optimal
regret bound is $\tilde{\Theta}(\frac{mKB^2_1\ln T} {\Delta\epsilon})$ with
both upper bound and lower bound, where $K$ is the maximum number of feedback
in each round. All above results nearly match corresponding non-private optimal
rates, which imply there is no additional price for (locally) differentially
private CSB in above common settings.
Related papers
- DP-Dueling: Learning from Preference Feedback without Compromising User Privacy [32.58099924135157]
We give the first differentially private dueling bandit algorithm for active learning with user preferences.
Our algorithms are computationally efficient with near-optimal performance.
We extend our results to any general decision space in $d$-dimensions with potentially infinite arms.
arXiv Detail & Related papers (2024-03-22T09:02:12Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - When Privacy Meets Partial Information: A Refined Analysis of
Differentially Private Bandits [4.964737844687583]
We study the problem of multi-armed bandits with $epsilon$-global Differential Privacy (DP)
We instantiate $epsilon$-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB.
AdaP-KLUCB is the first algorithm that both satisfies $epsilon$-global DP and yields a regret upper bound that matches the problem-dependent lower bound up to multiplicative constants.
arXiv Detail & Related papers (2022-09-06T15:26:24Z) - Cascading Bandit under Differential Privacy [21.936577816668944]
We study emphdifferential privacy (DP) and emphlocal differential privacy (LDP) in cascading bandits.
Under DP, we propose an algorithm which guarantees $epsilon$-indistinguishability and a regret of $mathcalO(fraclog3 Tepsilon)$ for an arbitrarily small $xi$.
Under ($epsilon$,$delta$)-LDP, we relax the $K2$ dependence through the tradeoff between privacy budgetepsilon$ and error probability $
arXiv Detail & Related papers (2021-05-24T07:19:01Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$
Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$
We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z) - Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment [7.4288915613206505]
We study differentially private online learning problems in a environment under both bandit and full information feedback.
For differentially private bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O left(sum_j: Delta_j>0 fracln(T)min leftDelta_j, epsilon right right)$ minimax lower bound.
For the same differentially private full information setting, we also present an $epsilon$-differentially
arXiv Detail & Related papers (2021-02-16T02:48:16Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs [99.59319332864129]
We show that UCBVI-$gamma$ achieves an $tildeObig(sqrtSAT/ (1-gamma)1.5big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $gamma$ is the discount factor and $T$ is the number of steps.
In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $tildeOmegabig(sqrtSAT/
arXiv Detail & Related papers (2020-10-01T17:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.