Related papers: High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

URL: http://arxiv.org/abs/2410.04080v1
Date: Sat, 5 Oct 2024 08:19:54 GMT
Title: High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions
Authors: Ruiyuan Huang, Zengfeng Huang,
Abstract summary: We study the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d from a specific distribution. Schneider and Zimmert (2023) recently proposed an algorithm that achieves nearly optimal expected regret. We present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability.
Score: 17.43748116766233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round's context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider and Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability. There are steps in the original analysis by Schneider and Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

Related papers

Sequential Probability Assignment with Contexts: Minimax Regret, Contextual Shtarkov Sums, and Contextual Normalized Maximum Likelihood [28.94461817548213]
We introduce a novel complexity measure, the emphcontextual Shtarkov sum, corresponding to the Shtarkov sum after projection onto a multiary context tree. We derive the minimax optimal strategy, dubbed emph Normalized Maximum Likelihood (cNML) Our results hold for sequential experts, beyond binary labels, which are settings rarely considered in prior work.
arXiv Detail & Related papers (2024-10-04T18:31:12Z)
Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown Marginals [28.94461817548213]
We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments. We are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting.
arXiv Detail & Related papers (2024-07-01T04:12:15Z)
A conversion theorem and minimax optimality for continuum contextual bandits [70.71582850199871]
We study the contextual continuum bandits problem, where the learner sequentially receives a side information vector and has to choose an action in a convex set. The goal is to minimize all the underlying functions for the received contexts, leading to the contextual notion of regret. We show that any algorithm achieving a sub-linear static regret can be extended to achieve a sub-linear contextual regret.
arXiv Detail & Related papers (2024-06-09T10:12:08Z)
Thompson Sampling in Partially Observable Contextual Bandits [2.465689259704613]
We study bandit policies for learning to select optimal arms based on the data of observations. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
arXiv Detail & Related papers (2024-02-15T19:37:39Z)
Optimal cross-learning for contextual bandits with unknown context distributions [28.087360479901978]
We consider the problem of designing contextual bandit algorithms in the cross-learning'' setting of Balseiro et al. We provide an efficient algorithm with a nearly tight (up to logarithmic factors) regret bound of $widetildeO(sqrtTK)$, independent of the number of contexts. At the core of our algorithm is a novel technique for coordinating the execution of an algorithm over multiple epochs.
arXiv Detail & Related papers (2024-01-03T18:02:13Z)
Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM) We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z)
Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms [39.70492757288025]
We address the contextual linear bandit problem, where a decision maker is provided a context. We show that the contextual problem can be solved as a linear bandit problem. Our results imply a $O(dsqrtTlog T)$ high-probability regret bound for contextual linear bandits.
arXiv Detail & Related papers (2022-11-08T22:18:53Z)
A Unifying Theory of Thompson Sampling for Continuous Risk-Averse Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem. Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals. We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z)
A Low Rank Promoting Prior for Unsupervised Contrastive Learning [108.91406719395417]
We construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning. Our hypothesis explicitly requires that all the samples belonging to the same instance class lie on the same subspace with small dimension. Empirical evidences show that the proposed algorithm clearly surpasses the state-of-the-art approaches on multiple benchmarks.
arXiv Detail & Related papers (2021-08-05T15:58:25Z)
Efficient First-Order Contextual Bandits: Prediction, Allocation, and Triangular Discrimination [82.52105963476703]
A recurring theme in statistical learning, online learning, and beyond is that faster convergence rates are possible for problems with low noise. First-order guarantees are relatively well understood in statistical and online learning. We show that the logarithmic loss and an information-theoretic quantity called the triangular discrimination play a fundamental role in obtaining first-order guarantees.
arXiv Detail & Related papers (2021-07-05T19:20:34Z)
Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes. Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z)
Latent Bandits Revisited [55.88616813182679]
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state. We propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
arXiv Detail & Related papers (2020-06-15T19:24:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.