Locally Differentially Private (Contextual) Bandits Learning
- URL: http://arxiv.org/abs/2006.00701v4
- Date: Fri, 15 Jan 2021 07:03:03 GMT
- Title: Locally Differentially Private (Contextual) Bandits Learning
- Authors: Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, Liwei Wang
- Abstract summary: We study locally differentially private (LDP) bandits learning in this paper.
We propose simple black-box reduction frameworks that can solve a large family of context-free bandits learning problems with LDP guarantee.
- Score: 55.63825598391525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study locally differentially private (LDP) bandits learning in this paper.
First, we propose simple black-box reduction frameworks that can solve a large
family of context-free bandits learning problems with LDP guarantee. Based on
our frameworks, we can improve previous best results for private bandits
learning with one-point feedback, such as private Bandits Convex Optimization,
and obtain the first result for Bandits Convex Optimization (BCO) with
multi-point feedback under LDP. LDP guarantee and black-box nature make our
frameworks more attractive in real applications compared with previous
specifically designed and relatively weaker differentially private (DP)
context-free bandits algorithms. Further, we extend our $(\varepsilon,
\delta)$-LDP algorithm to Generalized Linear Bandits, which enjoys a sub-linear
regret $\tilde{O}(T^{3/4}/\varepsilon)$ and is conjectured to be nearly
optimal. Note that given the existing $\Omega(T)$ lower bound for DP contextual
linear bandits (Shariff & Sheffe, 2018), our result shows a fundamental
difference between LDP and DP contextual bandits learning.
Related papers
- Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem.
It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling.
We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z) - Federated Linear Contextual Bandits with User-level Differential Privacy [8.396384861175799]
This paper studies federated linear contextual bandits under the notion of user-level differential privacy (DP)
We first introduce a unified federated bandits framework that can accommodate various definitions of DP in the sequential decision-making setting.
We then formally introduce user-level central DP (CDP) and local DP (LDP) in the federated bandits framework, and investigate the fundamental trade-offs between the learning regrets and the corresponding DP guarantees.
arXiv Detail & Related papers (2023-06-08T15:21:47Z) - Differentially Private Episodic Reinforcement Learning with Heavy-tailed
Rewards [12.809396600279479]
We study the problem of Markov decision processes (MDPs) with heavy-tailed rewards under the constraint of differential privacy (DP)
By resorting to robust mean estimators for rewards, we first propose two frameworks for heavy-tailed MDPs.
We show that there are fundamental differences between the problem of private RL with sub-Gaussian and that with heavy-tailed rewards.
arXiv Detail & Related papers (2023-06-01T20:18:39Z) - When Privacy Meets Partial Information: A Refined Analysis of
Differentially Private Bandits [4.964737844687583]
We study the problem of multi-armed bandits with $epsilon$-global Differential Privacy (DP)
We instantiate $epsilon$-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB.
AdaP-KLUCB is the first algorithm that both satisfies $epsilon$-global DP and yields a regret upper bound that matches the problem-dependent lower bound up to multiplicative constants.
arXiv Detail & Related papers (2022-09-06T15:26:24Z) - Generalized Linear Bandits with Local Differential Privacy [4.922800530841394]
Many applications such as personalized medicine and online advertising require the utilization of individual-specific information for effective learning.
This motivates the introduction of local differential privacy (LDP), a stringent notion in privacy, to contextual bandits.
In this paper, we design LDP algorithms for generalized linear bandits to achieve the same regret bound as in non-privacy settings.
arXiv Detail & Related papers (2021-06-07T06:42:00Z) - Optimal Rates of (Locally) Differentially Private Heavy-tailed
Multi-Armed Bandits [11.419534203345187]
We study the problem of multi-armed bandits (MAB) in the (local) differential privacy (DP/LDP) model.
We propose an algorithm which could be seen as locally private and robust version of the Successive Elimination (SE) algorithm.
arXiv Detail & Related papers (2021-06-04T16:17:21Z) - Towards Minimax Optimal Best Arm Identification in Linear Bandits [95.22854522340938]
We study the problem of best arm identification in linear bandits in the fixed-budget setting.
By leveraging properties of the G-optimal design and incorporating it into the arm allocation rule, we design a parameter-free algorithm.
We provide a theoretical analysis of the failure probability of OD-LinBAI.
arXiv Detail & Related papers (2021-05-27T09:19:10Z) - Upper Confidence Bounds for Combining Stochastic Bandits [52.10197476419621]
We provide a simple method to combine bandit algorithms.
Our approach is based on a "meta-UCB" procedure that treats each of $N$ individual bandit algorithms as arms in a higher-level $N$-armed bandit problem.
arXiv Detail & Related papers (2020-12-24T05:36:29Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.