Related papers: Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring

URL: http://arxiv.org/abs/2402.08321v2
Date: Sun, 16 Feb 2025 19:23:06 GMT
Title: Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring
Authors: Taira Tsuchiya, Shinji Ito, Junya Honda,
Abstract summary: Partial monitoring is a generic framework of online decision-making problems with limited feedback.<n>We present a new framework and analysis for ExO with a hybrid regularizer.<n>In particular, we derive a regret bound of $O(sum_a neq a*)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds.
Score: 41.20252349191698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, \emph{exploration by optimization} (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of $O(\sum_{a \neq a^*} k^2 m^2 \log T / \Delta_a)$, where $k$, $m$, and $T$ are the numbers of actions, observations and rounds, $a^*$ is an optimal action, and $\Delta_a$ is the suboptimality gap for action $a$. This bound is roughly $\Theta(k^2 \log T)$ times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first $O(\log T)$ stochastic bound.

Related papers

Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback [40.86836752251116]
We study the multi-armed bandit problem with adversarially chosen delays in the Best-Both-Worlds (BoBW) framework.<n>Our main contribution is a new algorithm that matches the known lower bounds in each setting individually.<n>We provide a regret bound which scale as $sum_i>0left(log T/Delta_iright) + frac1Ksum Delta_i sigma_max$, where $Delta_i$ is the sub-optimality gap of arm $i$ and $
arXiv Detail & Related papers (2025-05-30T04:05:52Z)
Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds [20.024434010891433]
We refer to this problem as Bayesian Optimization from Human Feedback (HF), which differs from the best actions from a reduced feedback model.<n>The objective is to identify the best action using a limited preference framework.<n>In other words, the same number of preferential emerging samples as scalar-valued samples is sufficient to find a nearly optimal solution.
arXiv Detail & Related papers (2025-05-29T17:17:29Z)
Beyond Minimax Rates in Group Distributionally Robust Optimization via a Novel Notion of Sparsity [14.396304498754688]
We show a novel notion of sparsity that we dub $(lambda, beta)$-sparsity. In short, there is a set of at most $beta$ groups whose risks at $theta$ all are at least $lambda$ larger than the risks of the other groups. We show how to get a dimension-free semi-adaptive sample complexity bound with a computationally efficient method.
arXiv Detail & Related papers (2024-10-01T13:45:55Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds [27.92755687977196]
The paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action.
arXiv Detail & Related papers (2023-02-24T00:02:24Z)
Contextual Decision-Making with Knapsacks Beyond the Worst Case [5.65888994172721]
We study the framework of a dynamic decision-making scenario with resource constraints. In this framework, an agent selects an action in each round upon observing a random request. We prove that our algorithm maintains a near-optimal $widetildeO(sqrtT)$ regret even in the worst cases.
arXiv Detail & Related papers (2022-11-25T08:21:50Z)
Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds [34.37963000493442]
This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both adversarial settings.
arXiv Detail & Related papers (2022-06-14T12:58:46Z)
Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP) Our objective is to return a policy that achieves large expected reward with a small constraint violation. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z)
Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations. This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation. We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z)
Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB) We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.