Related papers: GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

URL: http://arxiv.org/abs/2511.14045v1
Date: Tue, 18 Nov 2025 01:51:34 GMT
Title: GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards
Authors: Yule Liu, Heyi Zhang, Jinyi Zheng, Zhen Sun, Zifan Peng, Tianshuo Cong, Yilong Yang, Xinlei He, Zhuo Ma,
Abstract summary: Divergence-in-Behavior Attack (DIBA) is the first membership inference framework specifically designed for Reinforcement Learning with Verifiable Rewards.<n>We show that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR.<n>This is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that training data exposure can be reliably inferred through behavioral traces.
Score: 13.369116707284121
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Membership inference attacks (MIAs) on large language models (LLMs) pose significant privacy risks across various stages of model training. Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have brought a profound paradigm shift in LLM training, particularly for complex reasoning tasks. However, the on-policy nature of RLVR introduces a unique privacy leakage pattern: since training relies on self-generated responses without fixed ground-truth outputs, membership inference must now determine whether a given prompt (independent of any specific response) is used during fine-tuning. This creates a threat where leakage arises not from answer memorization. To audit this novel privacy risk, we propose Divergence-in-Behavior Attack (DIBA), the first membership inference framework specifically designed for RLVR. DIBA shifts the focus from memorization to behavioral change, leveraging measurable shifts in model behavior across two axes: advantage-side improvement (e.g., correctness gain) and logit-side divergence (e.g., policy drift). Through comprehensive evaluations, we demonstrate that DIBA significantly outperforms existing baselines, achieving around 0.8 AUC and an order-of-magnitude higher TPR@0.1%FPR. We validate DIBA's superiority across multiple settings--including in-distribution, cross-dataset, cross-algorithm, black-box scenarios, and extensions to vision-language models. Furthermore, our attack remains robust under moderate defensive measures. To the best of our knowledge, this is the first work to systematically analyze privacy vulnerabilities in RLVR, revealing that even in the absence of explicit supervision, training data exposure can be reliably inferred through behavioral traces.

Related papers

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z)
Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models [33.214586668992965]
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning.<n>We propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge.<n>Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning.
arXiv Detail & Related papers (2025-10-24T19:08:48Z)
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [57.56453588632619]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z)
Anomalous Decision Discovery using Inverse Reinforcement Learning [3.3675535571071746]
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems.<n>Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios.<n>We present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection.
arXiv Detail & Related papers (2025-07-06T17:01:02Z)
When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning [9.660010886245155]
This paper systematically investigates the privacy threats posed by membership inference attacks (MIAs) targeting encoder models.<n>We propose a novel membership inference attack method based on the p-norm of feature vectors, termed the Embedding Lp-Norm Likelihood Attack (LpLA)
arXiv Detail & Related papers (2025-06-06T05:03:29Z)
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
DeMem: Privacy-Enhanced Robust Adversarial Learning via De-Memorization [10.45538538066321]
Adversarial robustness is essential for ensuring the trustworthiness of machine learning models in real-world applications.<n>Previous studies have shown that enhancing adversarial robustness through adversarial training increases vulnerability to privacy attacks.<n>We propose DeMem, which selectively targets high-risk samples, achieving a better balance between privacy protection and model robustness.
arXiv Detail & Related papers (2024-12-08T00:22:58Z)
How Spurious Features Are Memorized: Precise Analysis for Random and NTK Features [19.261178173399784]
We consider spurious features that are uncorrelated with the learning task. We provide a precise characterization of how they are memorized via two separate terms. We prove that the memorization of spurious features weakens as the generalization capability increases.
arXiv Detail & Related papers (2023-05-20T05:27:41Z)
Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs. We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z)
Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness. We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.