Understanding Verbatim Memorization in LLMs Through Circuit Discovery
- URL: http://arxiv.org/abs/2506.21588v1
- Date: Tue, 17 Jun 2025 20:14:56 GMT
- Title: Understanding Verbatim Memorization in LLMs Through Circuit Discovery
- Authors: Ilya Lasy, Peter Knees, Stefan Woltran,
- Abstract summary: Underlying mechanisms of memorization in LLMs remain poorly understood.<n>We use transformer circuits -- the minimal computational subgraphs that perform specific functions within the model.<n>We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger its initiation.
- Score: 11.007171636579868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Underlying mechanisms of memorization in LLMs -- the verbatim reproduction of training data -- remain poorly understood. What exact part of the network decides to retrieve a token that we would consider as start of memorization sequence? How exactly is the models' behaviour different when producing memorized sentence vs non-memorized? In this work we approach these questions from mechanistic interpretability standpoint by utilizing transformer circuits -- the minimal computational subgraphs that perform specific functions within the model. Through carefully constructed contrastive datasets, we identify points where model generation diverges from memorized content and isolate the specific circuits responsible for two distinct aspects of memorization. We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger its initiation. Intriguingly, memorization prevention mechanisms transfer robustly across different text domains, while memorization induction appears more context-dependent.
Related papers
- Captured by Captions: On Memorization and its Mitigation in CLIP Models [23.005901198213966]
We propose a formal definition of memorization in CLIP and use it to quantify memorization in CLIP models.<n>Our results indicate that CLIP's memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization.<n>We find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain.
arXiv Detail & Related papers (2025-02-11T00:11:13Z) - Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications.
We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences.
We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z) - A Multi-Perspective Analysis of Memorization in Large Language Models [10.276594755936529]
Large Language Models (LLMs) show unprecedented performance in various fields.
LLMs can generate the same content used to train them.
This research comprehensively discussed memorization from various perspectives.
arXiv Detail & Related papers (2024-05-19T15:00:50Z) - PARMESAN: Parameter-Free Memory Search and Transduction for Dense Prediction Tasks [5.5127111704068374]
This work addresses flexibility in deep learning by means of transductive reasoning.<n>We propose PARMESAN, a scalable method which leverages a memory module for solving dense prediction tasks.<n>Our method is compatible with commonly used architectures and canonically transfers to 1D, 2D, and 3D grid-based data.
arXiv Detail & Related papers (2024-03-18T12:55:40Z) - Exploring Memorization in Fine-tuned Language Models [53.52403444655213]
We conduct the first comprehensive analysis to explore language models' memorization during fine-tuning across tasks.
Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that memorization presents a strong disparity among different fine-tuning tasks.
We provide an intuitive explanation of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution.
arXiv Detail & Related papers (2023-10-10T15:41:26Z) - Preventing Verbatim Memorization in Language Models Gives a False Sense
of Privacy [91.98116450958331]
We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization.
Specifically, we design and implement an efficient defense that perfectly prevents all verbatim memorization.
We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.
arXiv Detail & Related papers (2022-10-31T17:57:55Z) - Measures of Information Reflect Memorization Patterns [53.71420125627608]
We show that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization.
Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabelled in-distribution examples.
arXiv Detail & Related papers (2022-10-17T20:15:24Z) - Understanding Transformer Memorization Recall Through Idioms [42.28269674547148]
We offer the first methodological framework for probing and characterizing recall of memorized sequences in language models.
We analyze the internal prediction construction process by interpreting the model's hidden representations as a gradual refinement of the output probability distribution.
Our work makes a first step towards understanding memory recall, and provides a methodological basis for future studies of transformer memorization.
arXiv Detail & Related papers (2022-10-07T14:45:31Z) - Counterfactual Memorization in Neural Language Models [91.8747020391287]
Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data.
An open question in previous studies of language model memorization is how to filter out "common" memorization.
We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training.
arXiv Detail & Related papers (2021-12-24T04:20:57Z) - Encoding-based Memory Modules for Recurrent Neural Networks [79.42778415729475]
We study the memorization subtask from the point of view of the design and training of recurrent neural networks.
We propose a new model, the Linear Memory Network, which features an encoding-based memorization component built with a linear autoencoder for sequences.
arXiv Detail & Related papers (2020-01-31T11:14:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.