Related papers: Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

URL: http://arxiv.org/abs/2508.06247v1
Date: Fri, 08 Aug 2025 12:01:50 GMT
Title: Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits
Authors: Zichun Ye, Runqi Wang, Xutong Liu, Shuai Li,
Abstract summary: Minimax Optimal Strategy (CMOSS) achieves an instance-independent regret of $Obig(log k)2sqrtkmTbig )$ under semi-bandit feedback.<n>We show that CMOSS consistently outperforms benchmark algorithms in both regret runtime and efficiency.
Score: 12.655523567240042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor $\log T$ that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of $O\big( (\log k)^2\sqrt{kmT}\big )$ under semi-bandit feedback, where $m$ is the number of arms and $k$ is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on $\log T$ and matches the established $\Omega\big( \sqrt{kmT}\big)$ lower bound up to $O\big((\log k)^2\big)$. We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.

Related papers

Multi-Play Combinatorial Semi-Bandit Problem [8.922365714546162]
In the semi-bandit (CSB) problem, a player selects an action from a action set and observes feedback from the base arms included in the action.<n>We propose the multi-play semi-bandit (MP-CSB), where a player can select a non-negative integer action and observe multiple feedbacks from a single arm in each round.
arXiv Detail & Related papers (2025-09-12T02:46:59Z)
Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z)
Combinatorial Logistic Bandits [30.829239785016934]
We introduce a novel framework called logistic bandits (CLogB)<n>In each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary.<n>Experiments on real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.
arXiv Detail & Related papers (2024-10-22T14:52:46Z)
Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization [11.11876897168701]
We consider the problem of learning in adversarial Markov decision processes with an oblivious adversary.<n>We propose an algorithm, called APO-MVP, that achieves a regret bound of order $tildemathcalO(mathrmpoly(H)sqrtSAT)$.
arXiv Detail & Related papers (2024-07-08T08:06:45Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring [41.20252349191698]
Partial monitoring is a generic framework of online decision-making problems with limited feedback.<n>We present a new framework and analysis for ExO with a hybrid regularizer.<n>In particular, we derive a regret bound of $O(sum_a neq a*)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds.
arXiv Detail & Related papers (2024-02-13T09:34:22Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Federated Linear Bandits with Finite Adversarial Actions [20.1041278044797]
We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm. We prove that FedSupLinUCB achieves a total regret of $tildeO(sqrtd T)$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model.
arXiv Detail & Related papers (2023-11-02T03:41:58Z)
Contextual Combinatorial Bandits with Probabilistically Triggered Arms [55.9237004478033]
We study contextual bandits with probabilistically triggered arms (C$2$MAB-T) under a variety of smoothness conditions. Under the triggering modulated (TPM) condition, we devise the C$2$-UC-T algorithm and derive a regret bound $tildeO(dsqrtT)$.
arXiv Detail & Related papers (2023-03-30T02:51:00Z)
Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms or Independent Arms [59.8188496313214]
We study the semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound. First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we propose a BCUCB-T algorithm with variance-aware confidence intervals. Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition.
arXiv Detail & Related papers (2022-08-31T13:09:39Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
FANOK: Knockoffs in Linear Time [73.5154025911318]
We describe a series of algorithms that efficiently implement Gaussian model-X knockoffs to control the false discovery rate on large scale feature selection problems. We test our methods on problems with $p$ as large as $500,000$.
arXiv Detail & Related papers (2020-06-15T21:55:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.