Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret
with Adversarial Robustness in Partial Monitoring
- URL: http://arxiv.org/abs/2402.08321v1
- Date: Tue, 13 Feb 2024 09:34:22 GMT
- Title: Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret
with Adversarial Robustness in Partial Monitoring
- Authors: Taira Tsuchiya, Shinji Ito, Junya Honda
- Abstract summary: exploration by optimization (ExO) was proposed, which achieves the optimal bounds in adversarial environments.
We first establish a novel framework and analysis for ExO with a hybrid regularizer.
In particular, we derive a regret bound of $O(sum_a neq a* k2 log T / Delta_a)$ in which $a*$ is an optimal action, and $Delta_a$ is the suboptimality gap for action $a$.
- Score: 46.30750729936261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Partial monitoring is a generic framework of online decision-making problems
with limited observations. To make decisions from such limited observations, it
is necessary to find an appropriate distribution for exploration. Recently, a
powerful approach for this purpose, exploration by optimization (ExO), was
proposed, which achieves the optimal bounds in adversarial environments with
follow-the-regularized-leader for a wide range of online decision-making
problems. However, a naive application of ExO in stochastic environments
significantly degrades regret bounds. To resolve this problem in locally
observable games, we first establish a novel framework and analysis for ExO
with a hybrid regularizer. This development allows us to significantly improve
the existing regret bounds of best-of-both-worlds (BOBW) algorithms, which
achieves nearly optimal bounds both in stochastic and adversarial environments.
In particular, we derive a stochastic regret bound of $O(\sum_{a \neq a^*} k^2
m^2 \log T / \Delta_a)$, where $k$, $m$, and $T$ are the numbers of actions,
observations and rounds, $a^*$ is an optimal action, and $\Delta_a$ is the
suboptimality gap for action $a$. This bound is roughly $\Theta(k^2 \log T)$
times smaller than existing BOBW bounds. In addition, for globally observable
games, we provide a new BOBW algorithm with the first $O(\log T)$ stochastic
bound.
Related papers
- Towards Efficient and Optimal Covariance-Adaptive Algorithms for Combinatorial Semi-Bandits [12.674929126684528]
We address the problem of semi-bandits, where a player selects among $P$ actions from the power set of a set containing $d$ base items.
We show that our approach efficiently leverages the semi-bandit feedback and outperforms bandit feedback approaches.
arXiv Detail & Related papers (2024-02-23T08:07:54Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive
Regret Bounds [27.92755687977196]
The paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy.
At the higher level, the proposed algorithm adapts to a variety of types of environments.
The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action.
arXiv Detail & Related papers (2023-02-24T00:02:24Z) - Adversarially Robust Multi-Armed Bandit Algorithm with
Variance-Dependent Regret Bounds [34.37963000493442]
This paper considers the multi-armed bandit (MAB) problem and provides a new best-of-both-worlds (BOBW) algorithm that works nearly optimally in both adversarial settings.
arXiv Detail & Related papers (2022-06-14T12:58:46Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Bayesian Optimistic Optimisation with Exponentially Decaying Regret [58.02542541410322]
The current practical BO algorithms have regret bounds ranging from $mathcalO(fraclogNsqrtN)$ to $mathcal O(e-sqrtN)$, where $N$ is the number of evaluations.
This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation.
We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $mathcal O(N-sqrt
arXiv Detail & Related papers (2021-05-10T13:07:44Z) - Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.