MemLens: Uncovering Memorization in LLMs with Activation Trajectories
- URL: http://arxiv.org/abs/2509.20909v1
- Date: Thu, 25 Sep 2025 08:55:18 GMT
- Title: MemLens: Uncovering Memorization in LLMs with Activation Trajectories
- Authors: Zirui He, Haiyan Zhao, Ali Payani, Mengnan du,
- Abstract summary: We propose MemLens to detect memorization by analyzing the probability trajectories of numeric tokens during generation.<n>Our method reveals that contaminated samples exhibit shortcut'' behaviors, locking onto an answer with high confidence.<n>We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories.
- Score: 39.5728313604839
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.
Related papers
- Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks [10.807620342718309]
The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values.<n>We study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon.<n>We propose techniques such as checkpoint merging and memorization-aware reweighting to mitigate this effect.
arXiv Detail & Related papers (2025-08-06T06:34:12Z) - BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z) - Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models [12.519879298717104]
We propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms.<n>Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance.
arXiv Detail & Related papers (2025-05-29T02:49:29Z) - A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective [15.33961902853653]
We quantify memorization for each real sample based on how many generated samples are flagged as replicas.<n>Our empirical analysis reveals a heavy-tailed distribution of memorization counts.<n>We propose DynamicCut, a two-stage, model-agnostic mitigation method.
arXiv Detail & Related papers (2025-05-28T13:06:00Z) - Redistribute Ensemble Training for Mitigating Memorization in Diffusion Models [31.92526915009259]
Diffusion models are known for their tremendous ability to generate high-quality samples.<n>Recent methods for memory mitigation have primarily addressed the issue within the context of the text modality.<n>We propose a novel method for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization.
arXiv Detail & Related papers (2025-02-13T15:56:44Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.