Related papers: Robust Q-Learning under Corrupted Rewards

Robust Q-Learning under Corrupted Rewards

URL: http://arxiv.org/abs/2409.03237v1
Date: Thu, 5 Sep 2024 04:37:02 GMT
Title: Robust Q-Learning under Corrupted Rewards
Authors: Sreejeet Maity, Aritra Mitra,
Abstract summary: We investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model. We develop a novel robust synchronous Q-learning algorithm that uses historical reward data to construct robust empirical Bellman operators. Our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.
Score: 2.07180164747172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments, such as in the presence of corrupted rewards, is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q-learning algorithm to incur arbitrarily large errors. We then develop a novel robust synchronous Q-learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step. Finally, we prove a finite-time convergence rate for our algorithm that matches known state-of-the-art bounds (in the absence of attacks) up to a small inevitable $O(\varepsilon)$ error term that scales with the adversarial corruption fraction $\varepsilon$. Notably, our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.

Related papers

Adversarially-Robust TD Learning with Markovian Data: Finite-Time Rates and Fundamental Limits [2.07180164747172]
Motivated by harsh, real-world environments, we revisit the policy evaluation problem from the perspective of adversarial robustness. We develop a novel algorithm called Robust-TD and prove that its finite-time guarantees match that of vanilla TD with linear function approximation up to a small $O(epsilon)$ term. To our knowledge, these results are the first of their kind in the context of adversarial approximation schemes driven by Markov noise.
arXiv Detail & Related papers (2025-02-07T05:05:42Z)
Agnostic Multi-Robust Learning Using ERM [19.313739782029185]
A fundamental problem in robust learning is asymmetry: a learner needs to correctly classify every one of exponentially-many perturbations that an adversary might make to a test-time natural example. In contrast, the attacker only needs to find one successful perturbation. We introduce a novel multi-group setting and introduce a novel multi-robust learning problem.
arXiv Detail & Related papers (2023-03-15T21:30:14Z)
Stabilizing Off-Policy Deep Reinforcement Learning from Pixels [9.998078491879145]
Off-policy reinforcement learning from pixel observations is notoriously unstable. We show that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards. We propose A-LIX, a method providing adaptive regularization to the encoder's gradients that explicitly prevents the occurrence of catastrophic self-overfitting.
arXiv Detail & Related papers (2022-07-03T08:52:40Z)
A Robust Phased Elimination Algorithm for Corruption-Tolerant Gaussian Process Bandits [118.22458816174144]
We propose a novel robust elimination-type algorithm that runs in epochs, combines exploration with infrequent switching to select a small subset of actions, and plays each action for multiple time instants. Our algorithm, GP Robust Phased Elimination (RGP-PE), successfully balances robustness to corruptions with exploration and exploitation. We perform the first empirical study of robustness in the corrupted GP bandit setting, and show that our algorithm is robust against a variety of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T21:19:36Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Outlier-Robust Learning of Ising Models Under Dobrushin's Condition [57.89518300699042]
We study the problem of learning Ising models satisfying Dobrushin's condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees.
arXiv Detail & Related papers (2021-02-03T18:00:57Z)
Byzantine-Resilient Non-Convex Stochastic Gradient Descent [61.6382287971982]
adversary-resilient distributed optimization, in which. machines can independently compute gradients, and cooperate. Our algorithm is based on a new concentration technique, and its sample complexity. It is very practical: it improves upon the performance of all prior methods when no. setting machines are present.
arXiv Detail & Related papers (2020-12-28T17:19:32Z)
Sample-Optimal and Efficient Learning of Tree Ising models [24.201827888085944]
We show that $n$-variable tree-structured Ising models can be learned computationally-efficiently to within total variation distance $epsilon$ from an optimal $O(n ln n/epsilon2)$ samples. Our guarantees do not follow from known results for the Chow-Liu algorithm.
arXiv Detail & Related papers (2020-10-28T10:17:48Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z)
Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms. We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework. Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z)
Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison [17.692408242465763]
We prove performance guarantees of two algorithms for approxing $Qstar$ in batch reinforcement learning. One of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation.
arXiv Detail & Related papers (2020-03-09T05:12:39Z)
Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack [41.060507338755784]
This paper investigates the attack model where an adversary attacks with a certain probability at each round, and its attack value can be arbitrary and unbounded if it attacks. We propose a novel sample median-based and exploration-aided UCB algorithm (called med-E-UCB) and a median-based $epsilon$-greedy algorithm (called med-$epsilon$-greedy) Both algorithms are provably robust to the aforementioned attack model. More specifically we show that both algorithms achieve $mathcalO(log T)$ pseudo-regret (i.e
arXiv Detail & Related papers (2020-02-17T19:21:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.