Simultaneously Learning Stochastic and Adversarial Bandits under the
Position-Based Model
- URL: http://arxiv.org/abs/2207.05437v1
- Date: Tue, 12 Jul 2022 10:00:14 GMT
- Title: Simultaneously Learning Stochastic and Adversarial Bandits under the
Position-Based Model
- Authors: Cheng Chen, Canzhe Zhao, Shuai Li
- Abstract summary: This work studies the online learning to rank problem in both and adversarial environments under the position-based model.
We prove the proposed algorithm simultaneously achieves $O(logT)$ regret in the adversarial environment and $O(msqrtnT)$ regret in the adversarial environment.
Experiments show that our algorithm could simultaneously learn in both and adversarial environments and is competitive compared to existing methods.
- Score: 9.945948163150874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online learning to rank (OLTR) interactively learns to choose lists of items
from a large collection based on certain click models that describe users'
click behaviors. Most recent works for this problem focus on the stochastic
environment where the item attractiveness is assumed to be invariant during the
learning process. In many real-world scenarios, however, the environment could
be dynamic or even arbitrarily changing. This work studies the OLTR problem in
both stochastic and adversarial environments under the position-based model
(PBM). We propose a method based on the follow-the-regularized-leader (FTRL)
framework with Tsallis entropy and develop a new self-bounding constraint
especially designed for PBM. We prove the proposed algorithm simultaneously
achieves $O(\log{T})$ regret in the stochastic environment and $O(m\sqrt{nT})$
regret in the adversarial environment, where $T$ is the number of rounds, $n$
is the number of items and $m$ is the number of positions. We also provide a
lower bound of order $\Omega(m\sqrt{nT})$ for adversarial PBM, which matches
our upper bound and improves over the state-of-the-art lower bound. The
experiments show that our algorithm could simultaneously learn in both
stochastic and adversarial environments and is competitive compared to existing
methods that are designed for a single environment.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret
with Adversarial Robustness in Partial Monitoring [46.30750729936261]
exploration by optimization (ExO) was proposed, which achieves the optimal bounds in adversarial environments.
We first establish a novel framework and analysis for ExO with a hybrid regularizer.
In particular, we derive a regret bound of $O(sum_a neq a* k2 log T / Delta_a)$ in which $a*$ is an optimal action, and $Delta_a$ is the suboptimality gap for action $a$.
arXiv Detail & Related papers (2024-02-13T09:34:22Z) - Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z) - When are Local Queries Useful for Robust Learning? [25.832511407411637]
We study learning models where the learner is given more power through the use of local queries.
We give the first distribution-free algorithms that perform robust empirical risk minimization.
We finish by giving robust learning algorithms for halfspaces on $0,1n$ and then obtaining robustness guarantees for halfspaces in $mathbbRn$ against precision-bounded adversaries.
arXiv Detail & Related papers (2022-10-12T11:04:22Z) - Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps.
Depending on the length of the episode, the learner may not be able to estimate accurately the latent context.
We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z) - Time Discretization-Invariant Safe Action Repetition for Policy Gradient
Methods [43.49494338665518]
We propose a $delta$-invariant algorithm for policy gradient (PG) methods.
We show that our method is not only $delta$-invariant but also robust toity, outperforming previous $delta$-invariant approaches.
arXiv Detail & Related papers (2021-11-06T19:17:24Z) - Iterative Feature Matching: Toward Provable Domain Generalization with
Logarithmic Environments [55.24895403089543]
Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments.
We present a new algorithm based on performing iterative feature matching that is guaranteed with high probability to yield a predictor that generalizes after seeing only $O(logd_s)$ environments.
arXiv Detail & Related papers (2021-06-18T04:39:19Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Toward Adversarial Robustness via Semi-supervised Robust Training [93.36310070269643]
Adrial examples have been shown to be the severe threat to deep neural networks (DNNs)
We propose a novel defense method, the robust training (RT), by jointly minimizing two separated risks ($R_stand$ and $R_rob$)
arXiv Detail & Related papers (2020-03-16T02:14:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.