Online Nonsubmodular Optimization with Delayed Feedback in the Bandit Setting
- URL: http://arxiv.org/abs/2508.00523v1
- Date: Fri, 01 Aug 2025 11:00:19 GMT
- Title: Online Nonsubmodular Optimization with Delayed Feedback in the Bandit Setting
- Authors: Sifan Yang, Yuanyu Wan, Lijun Zhang,
- Abstract summary: We investigate the online nonsubmodular optimization with delayed feedback in the bandit setting.<n>We propose a novel method, namely DBGD-NF, which employs the one-point gradient estimator.<n>We also extend DBGD-NF by employing a blocking update mechanism to decouple the joint effect of the delays and bandit feedback.
- Score: 22.208290858529672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the online nonsubmodular optimization with delayed feedback in the bandit setting, where the loss function is $\alpha$-weakly DR-submodular and $\beta$-weakly DR-supermodular. Previous work has established an $(\alpha,\beta)$-regret bound of $\mathcal{O}(nd^{1/3}T^{2/3})$, where $n$ is the dimensionality and $d$ is the maximum delay. However, its regret bound relies on the maximum delay and is thus sensitive to irregular delays. Additionally, it couples the effects of delays and bandit feedback as its bound is the product of the delay term and the $\mathcal{O}(nT^{2/3})$ regret bound in the bandit setting without delayed feedback. In this paper, we develop two algorithms to address these limitations, respectively. Firstly, we propose a novel method, namely DBGD-NF, which employs the one-point gradient estimator and utilizes all the available estimated gradients in each round to update the decision. It achieves a better $\mathcal{O}(n\bar{d}^{1/3}T^{2/3})$ regret bound, which is relevant to the average delay $\bar{d} = \frac{1}{T}\sum_{t=1}^T d_t\leq d$. Secondly, we extend DBGD-NF by employing a blocking update mechanism to decouple the joint effect of the delays and bandit feedback, which enjoys an $\mathcal{O}(n(T^{2/3} + \sqrt{dT}))$ regret bound. When $d = \mathcal{O}(T^{1/3})$, our regret bound matches the $\mathcal{O}(nT^{2/3})$ bound in the bandit setting without delayed feedback. Compared to our first $\mathcal{O}(n\bar{d}^{1/3}T^{2/3})$ bound, it is more advantageous when the maximum delay $d = o(\bar{d}^{2/3}T^{1/3})$. Finally, we conduct experiments on structured sparse learning to demonstrate the superiority of our methods.
Related papers
- Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z) - Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs [60.7808741738461]
We study online learning with oblivious delays under a novel clairvoyance'' that limits how many past rounds can be tracked simultaneously for delayed feedback.<n>Our algorithms achieve minimax regrets across all capacity levels, with performance degrading gracefully under suboptimal capacity.
arXiv Detail & Related papers (2025-03-25T17:20:39Z) - Adversarial Combinatorial Bandits with Switching Costs [55.2480439325792]
We study the problem of adversarial bandit with a switching cost $lambda$ for a switch of each selected arm in each round.
We derive lower bounds for the minimax regret and design algorithms to approach them.
arXiv Detail & Related papers (2024-04-02T12:15:37Z) - Improved Regret for Bandit Convex Optimization with Delayed Feedback [50.46856739179311]
bandit convex optimization (BCO) with delayed feedback, where only the loss value of the action is revealed under a delay.
We develop a novel algorithm, and prove that it enjoys a regret bound of $O(sqrtnT3/4+sqrtdT)$ in general.
We show that the proposed algorithm can improve the regret bound to $O((nT)2/3log/3T+dlog T)$ for strongly convex functions.
arXiv Detail & Related papers (2024-02-14T13:08:26Z) - Nearly Minimax Optimal Submodular Maximization with Bandit Feedback [12.28389976959093]
We minimize the learner's regret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| = k$.<n>In this work, we establish the first minimax lower bound for this setting that scales like $tildeOmega(min_L le k(T2/3 + sqrtn choose k - LT)$.<n>For a slightly restricted algorithm class, we prove a stronger regret lower bound of $tildeOmega(min_L
arXiv Detail & Related papers (2023-10-27T20:19:03Z) - Borda Regret Minimization for Generalized Linear Dueling Bandits [65.09919504862496]
We study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score.
We propose a rich class of generalized linear dueling bandit models, which cover many existing models.
Our algorithm achieves an $tildeO(d2/3 T2/3)$ regret, which is also optimal.
arXiv Detail & Related papers (2023-03-15T17:59:27Z) - Smooth Non-Stationary Bandits [49.19728527803684]
We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $beta$-H"older function.
We show the first separation between the smooth (i.e., $betage 2$) and non-smooth (i.e., $beta=1$) regimes by presenting a policy with $tilde O(k4/5 T3/5)$ regret on any $k$-armed, $2$-H"older instance.
arXiv Detail & Related papers (2023-01-29T06:03:20Z) - A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback [25.68113242132723]
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback.
We simultaneously achieves a near-optimal adversarial regret guarantee in the setting with fixed delays.
We also present an extension of the algorithm to the case of arbitrary delays.
arXiv Detail & Related papers (2022-06-29T20:49:45Z) - Projection-free Online Learning with Arbitrary Delays [38.13351554274417]
We generalize the online Frank-Wolfe (OFW) algorithm and the online smooth projection-free (OSPF) algorithm into a delayed setting.
Our novel analysis shows that under a relatively large amount of delay, the generalized OFW and OSPF enjoy the same regret bound as OFW and OSPF in the non-delayed setting.
arXiv Detail & Related papers (2022-04-11T09:26:24Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays [21.94728545221709]
We consider the Scale-Free Adversarial Multi Armed Bandit (MAB) problem with unrestricted feedback delays.
textttSFBanker achieves $mathcal O(sqrtK(D+T)L)cdot rm polylog(T, L)$ total regret, where $T$ is the total number of steps and $D$ is the total feedback delay.
arXiv Detail & Related papers (2021-10-26T04:06:51Z) - Adapting to Delays and Data in Adversarial Multi-Armed Bandits [7.310043452300736]
We analyze variants of the Exp3 algorithm that tune their step-size using only information available at the time of the decisions.
We obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses.
arXiv Detail & Related papers (2020-10-12T20:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.