Related papers: RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation

RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2506.15513v1
Date: Wed, 18 Jun 2025 14:48:19 GMT
Title: RePCS: Diagnosing Data Memorization in LLM-Powered Retrieval-Augmented Generation
Authors: Le Vu Anh, Nguyen Viet Anh, Mehmet Dik, Luong Van Nghia,
Abstract summary: Models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs.<n>We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-augmented generation (RAG) has become a common strategy for updating large language model (LLM) responses with current, external information. However, models may still rely on memorized training data, bypass the retrieved evidence, and produce contaminated outputs. We introduce Retrieval-Path Contamination Scoring (RePCS), a diagnostic method that detects such behavior without requiring model access or retraining. RePCS compares two inference paths: (i) a parametric path using only the query, and (ii) a retrieval-augmented path using both the query and retrieved context by computing the Kullback-Leibler (KL) divergence between their output distributions. A low divergence suggests that the retrieved context had minimal impact, indicating potential memorization. This procedure is model-agnostic, requires no gradient or internal state access, and adds only a single additional forward pass. We further derive PAC-style guarantees that link the KL threshold to user-defined false positive and false negative rates. On the Prompt-WNQA benchmark, RePCS achieves a ROC-AUC of 0.918. This result outperforms the strongest prior method by 6.5 percentage points while keeping latency overhead below 4.7% on an NVIDIA T4 GPU. RePCS offers a lightweight, black-box safeguard to verify whether a RAG system meaningfully leverages retrieval, making it especially valuable in safety-critical applications.

Related papers

PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG [14.631028226704883]
We introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS)<n>PAIRS integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information.<n>We show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines.
arXiv Detail & Related papers (2025-08-06T03:33:01Z)
DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection [0.9499594220629591]
Adrial prompt attacks can significantly alter the reliability of Retrieval-Augmented Generation (RAG) systems.<n>We present a novel method that applies Differential Evolution (DE) to optimize adversarial prompt suffixes for RAG-based question answering.
arXiv Detail & Related papers (2025-07-20T16:48:20Z)
Maximally-Informative Retrieval for State Space Model Generation [59.954191072042526]
We introduce Retrieval In-Context Optimization (RICO) to minimize model uncertainty for a particular query at test-time.<n>Unlike traditional retrieval-augmented generation (RAG), which relies on externals for document retrieval, our approach leverages direct feedback from the model.<n>We show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss.
arXiv Detail & Related papers (2025-06-13T18:08:54Z)
SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping [30.85025293160079]
High-frequency components, or later steps, in the generation process contribute disproportionately to inference latency.<n>We identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy.<n>We propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency.
arXiv Detail & Related papers (2025-06-10T15:35:29Z)
GPS: Distilling Compact Memories via Grid-based Patch Sampling for Efficient Online Class-Incremental Learning [20.112448377660854]
We introduce Grid-based Patch Sampling (GPS), a lightweight strategy for distilling informative memory samples without relying on a trainable model.<n>GPS generates informative samples by sampling a subset of pixels from the original image, yielding compact low-resolution representations.<n>GPS can be seamlessly integrated into existing replay frameworks, leading to 3%-4% improvements in average end accuracy under memory-constrained settings.
arXiv Detail & Related papers (2025-04-14T16:58:02Z)
Out-of-Distribution Detection through Soft Clustering with Non-Negative Kernel Regression [28.828318027398815]
We propose a novel soft clustering approach for OOD detection based on non-negative kernel regression. Our approach greatly reduces computational and space complexities (up to 11x improvement in inference time and 87% reduction in storage requirements) and outperforms existing approaches by up to 4 AUROC points on four different benchmarks.
arXiv Detail & Related papers (2024-07-18T03:57:08Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Q-DETR: An Efficient Low-Bit Quantized Detection Transformer [50.00784028552792]
We find that the bottlenecks of Q-DETR come from the query information distortion through our empirical analyses. We formulate our DRD as a bi-level optimization problem, which can be derived by generalizing the information bottleneck (IB) principle to the learning of Q-DETR. We introduce a new foreground-aware query matching scheme to effectively transfer the teacher information to distillation-desired features to minimize the conditional information entropy.
arXiv Detail & Related papers (2023-04-01T08:05:14Z)
Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples. We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework. We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z)
Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning. Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case. We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.