Self-Concordant Perturbations for Linear Bandits
- URL: http://arxiv.org/abs/2510.24187v1
- Date: Tue, 28 Oct 2025 08:47:15 GMT
- Title: Self-Concordant Perturbations for Linear Bandits
- Authors: Lucas Lévy, Jean-Lou Valeau, Arya Akhavan, Patrick Rebeschini,
- Abstract summary: We present a unified algorithmic framework that bridges Follow-the-Regularized-Leader and Follow-the-Perturbed-Leader methods.<n>We introduce self-concordant perturbations, a family of probability distributions that mirror the role of self-concordant barriers.<n>Our approach achieves a regret of $O(dsqrtn ln)$ on both the $d$-dimensional hypercube and the Euclidean ball.
- Score: 9.957131269346096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the adversarial linear bandits problem and present a unified algorithmic framework that bridges Follow-the-Regularized-Leader (FTRL) and Follow-the-Perturbed-Leader (FTPL) methods, extending the known connection between them from the full-information setting. Within this framework, we introduce self-concordant perturbations, a family of probability distributions that mirror the role of self-concordant barriers previously employed in the FTRL-based SCRiBLe algorithm. Using this idea, we design a novel FTPL-based algorithm that combines self-concordant regularization with efficient stochastic exploration. Our approach achieves a regret of $O(d\sqrt{n \ln n})$ on both the $d$-dimensional hypercube and the Euclidean ball. On the Euclidean ball, this matches the rate attained by existing self-concordant FTRL methods. For the hypercube, this represents a $\sqrt{d}$ improvement over these methods and matches the optimal bound up to logarithmic factors.
Related papers
- Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z) - Graph-based Clustering Revisited: A Relaxation of Kernel $k$-Means Perspective [73.18641268511318]
We propose a graph-based clustering algorithm that only relaxes the orthonormal constraint to derive clustering results.<n>To ensure a doubly constraint into a gradient, we transform the non-negative constraint into a class probability parameter.
arXiv Detail & Related papers (2025-09-23T09:14:39Z) - Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits [3.448177863267093]
We introduce the first best-of-both-worlds algorithm for contextual semi-bandits that simultaneously guarantees $widetildemathcalO(sqrtT)$ regret.<n>By leveraging the Karush-Kuhn-Tucker conditions, we transform the $K$ convex projection problem into a single-dimensional root-finding problem.<n> Empirical evaluations demonstrate that this combined strategy not only attains the attractive regret bounds of best-of-both-worlds algorithms but also delivers substantial per-round speed-ups.
arXiv Detail & Related papers (2025-08-26T07:51:22Z) - Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update [70.38810219913593]
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function.<n>GLBs are widely applicable to real-world scenarios, but their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency.<n>We propose a jointly efficient algorithm that attains a nearly optimal regret bound with $mathcalO(1)$ time and space complexities per round.
arXiv Detail & Related papers (2025-07-16T02:24:21Z) - Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z) - A Near-optimal, Scalable and Corruption-tolerant Framework for Stochastic Bandits: From Single-Agent to Multi-Agent and Beyond [2.5217803205496283]
We propose a novel framework called BARBAT, which eliminates the factor of $K$ and achieves an optimal regret bound to a logarithmic factor.<n>We also demonstrate how BARBAT can be extended to various settings, including graph bandits, semi-bandits, batched bandits and multi-agent bandits.
arXiv Detail & Related papers (2025-02-11T12:33:33Z) - Optimism in the Face of Ambiguity Principle for Multi-Armed Bandits [6.075593833879357]
Follow-The-Regularized-Leader (FTRL) algorithms often enjoy optimal regret for adversarial as well as bandit problems.<n>We propose a new FTPL algorithm that generates optimal policies for both adversarial and multi-armed bandits.
arXiv Detail & Related papers (2024-09-30T16:00:23Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous
Q-Learning and TD-Learning Variants [39.28675942566465]
This paper develops a framework to study finite-sample convergence guarantees of a class of value-based asynchronous RL algorithms.
As a by-product, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL.
arXiv Detail & Related papers (2021-02-02T15:48:19Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.