Optimal Perturbation Budget Allocation for Data Poisoning in Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2512.08485v2
- Date: Wed, 10 Dec 2025 07:43:34 GMT
- Title: Optimal Perturbation Budget Allocation for Data Poisoning in Offline Reinforcement Learning
- Authors: Junnan Qiu, Yuanjie Zhao, Jie Li,
- Abstract summary: Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to data poisoning attacks.<n>Existing attack strategies typically rely on locally uniform perturbations, which treat all samples indiscriminately.<n>This approach is inefficient, as it wastes the perturbation budget on low-impact samples, and lacks stealthiness due to significant statistical deviations.
- Score: 3.548727497699329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to data poisoning attacks. Existing attack strategies typically rely on locally uniform perturbations, which treat all samples indiscriminately. This approach is inefficient, as it wastes the perturbation budget on low-impact samples, and lacks stealthiness due to significant statistical deviations. In this paper, we propose a novel Global Budget Allocation attack strategy. Leveraging the theoretical insight that a sample's influence on value function convergence is proportional to its Temporal Difference (TD) error, we formulate the attack as a global resource allocation problem. We derive a closed-form solution where perturbation magnitudes are assigned proportional to the TD-error sensitivity under a global L2 constraint. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms baseline strategies, achieving up to 80% performance degradation with minimal perturbations that evade detection by state-of-the-art statistical and spectral defenses.
Related papers
- Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z) - CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning [7.5200963577855875]
Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks.<n>We propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget.
arXiv Detail & Related papers (2026-01-15T13:57:52Z) - The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks [51.468144272905135]
Deep neural networks (DNNs) underpin critical applications yet remain vulnerable to backdoor attacks.<n>We provide a theoretical analysis targeting backdoor attacks, focusing on how sparse decision boundaries enable disproportionate model manipulation.<n>We propose Eminence, an explainable and robust black-box backdoor framework with provable theoretical guarantees and inherent stealth properties.
arXiv Detail & Related papers (2025-12-11T08:09:07Z) - Data-Efficient RLVR via Off-Policy Influence Guidance [84.60336960383867]
This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective.<n>We develop textbfCurriculum textbfRL with textbfOff-textbfPolicy textInfluence guidance (textbfCROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy.
arXiv Detail & Related papers (2025-10-30T13:40:52Z) - Distributionally Robust Optimization with Adversarial Data Contamination [49.89480853499918]
We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions.<n>Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts.<n>This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.
arXiv Detail & Related papers (2025-07-14T18:34:10Z) - Sampling-aware Adversarial Attacks Against Large Language Models [52.30089653615172]
Existing adversarial attacks typically target harmful responses in single-point greedy generations.<n>We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack prompt optimization.<n>We show that integrating sampling into existing attacks boosts success rates by up to 37% and improves efficiency by up to two orders of magnitude.
arXiv Detail & Related papers (2025-07-06T16:13:33Z) - Collapsing Sequence-Level Data-Policy Coverage via Poisoning Attack in Offline Reinforcement Learning [12.068924459730248]
Existing studies aim to improve data-policy coverage to mitigate distributional shifts, but overlook security risks from insufficient coverage.<n>We introduce the sequence-level concentrability coefficient to quantify coverage, and reveal its exponential amplification on the upper bound of estimation errors.<n>We identify rare patterns likely to cause insufficient coverage, and poison them to reduce coverage and exacerbate distributional shifts.
arXiv Detail & Related papers (2025-06-12T07:11:27Z) - Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective [17.812046299904576]
We propose the Distribution-Aware Projected Gradient Descent attack (DAPGD)<n>DAPGD uses distribution similarity as the gradient perturbation input to attack the policy network.<n>Our experiment results demonstrate that DAPGD achieves SOTA results compared to the baselines in three robot navigation tasks.
arXiv Detail & Related papers (2025-01-07T06:22:55Z) - Practical Performative Policy Learning with Strategic Agents [8.361090623217246]
We study the performative policy learning problem, where agents adjust their features in response to a released policy to improve their potential outcomes.<n>We propose a gradient-based policy optimization algorithm with a differentiable classifier as a substitute for the high-dimensional distribution map.
arXiv Detail & Related papers (2024-12-02T10:09:44Z) - Sample Dropout: A Simple yet Effective Variance Reduction Technique in
Deep Policy Optimization [18.627233013208834]
We show that the use of importance sampling could introduce high variance in the objective estimate.
We propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high.
arXiv Detail & Related papers (2023-02-05T04:44:35Z) - The Power and Limitation of Pretraining-Finetuning for Linear Regression
under Covariate Shift [127.21287240963859]
We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data.
For a large class of linear regression instances, transfer learning with $O(N2)$ source data is as effective as supervised learning with $N$ target data.
arXiv Detail & Related papers (2022-08-03T05:59:49Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.