Related papers: Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

URL: http://arxiv.org/abs/2503.02296v2
Date: Tue, 30 Sep 2025 00:17:02 GMT
Title: Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting
Authors: Lizhe Zhang, Wentao Chen, Li Zhong, Letian Peng, Zilong Wang, Jingbo Shang,
Abstract summary: We argue that large language models (LLMs) are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization.<n>Existing evaluations largely proxy neglecting surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and memorization task correctness.<n>We propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart.
Score: 54.48306552577881
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and neglecting task correctness under semantic variation. We define harmful memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding task, then reverse-engineers a novel coding task. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold -- when the model outputs similar code but fails the perturbed task -- thereby capturing harmful memorization rather than benign reuse of repeated code. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal that (1) memorization does not increase with larger models and in many cases alleviates as they scale; (2) supervised fine-tuning (SFT) improves accuracy while introduces memorization; (3) reinforcement learning with proximal policy optimization (PPO) achieves a more balanced trade-off between memorization and generalization.

Related papers

Readability-Robust Code Summarization via Meta Curriculum Learning [53.44612630063336]
In the real world, code is often poorly structured or obfuscated, significantly degrading model performance.<n>We propose RoFTCodeSum, a novel fine-tuning method that enhances the robustness of code summarization against poorly readable code.
arXiv Detail & Related papers (2026-01-09T02:38:24Z)
Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation [11.819481846962447]
We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates.<n>Our framework uses episodic memory to store instance-level critiques and distill these into reusable, task-level guidance.<n>Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.
arXiv Detail & Related papers (2025-10-22T17:58:03Z)
Memorization Sinks: Isolating Memorization during LLM Training [20.682505625638203]
Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns.<n>We propose a new paradigm of MemSinks that promotes isolation of memorization by design.<n>This is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable.
arXiv Detail & Related papers (2025-07-14T05:23:27Z)
Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning [9.719614935865906]
This paper investigates Large Language Models (LLMs) reasoning ability over code snippets within large repositories.<n>We differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does)<n>Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context.
arXiv Detail & Related papers (2025-05-19T16:56:31Z)
Rethinking Repetition Problems of LLMs in Code Generation [36.42947561896802]
We propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar.<n> RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions.<n>Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset.
arXiv Detail & Related papers (2025-05-15T15:26:32Z)
The Pitfalls of Memorization: When Memorization Hurts Generalization [28.5600484308805]
Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns.<n>We propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits.
arXiv Detail & Related papers (2024-12-10T17:18:33Z)
On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems. We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z)
Unlocking Memorization in Large Language Models with Dynamic Soft Prompting [66.54460367290146]
Large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation. LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement. We propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts.
arXiv Detail & Related papers (2024-09-20T18:56:32Z)
Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications. We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences. We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
To Each (Textual Sequence) Its Own: Improving Memorized-Data Unlearning in Large Language Models [3.4990427823966828]
LLMs have been found to memorize training textual sequences and regurgitate verbatim said sequences during text generation time. This fact is known to be the cause of privacy and related (e.g., copyright) problems. Unlearning in LLMs then takes the form of devising new algorithms that will properly deal with these side-effects.
arXiv Detail & Related papers (2024-05-06T01:21:50Z)
Continual Referring Expression Comprehension via Dual Modular Memorization [133.46886428655426]
Referring Expression (REC) aims to localize an image region of a given object described by a natural-language expression. Existing REC algorithms make a strong assumption that training data feeding into a model are given upfront, which degrades its practicality for real-world scenarios. In this paper, we propose Continual Referring Expression (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks. In order to continuously improve the model on sequential tasks without forgetting prior learned knowledge and without repeatedly re-training from a scratch, we propose an effective baseline method named Dual Modular Memorization
arXiv Detail & Related papers (2023-11-25T02:58:51Z)
Exploring Memorization in Fine-tuned Language Models [53.52403444655213]
We conduct the first comprehensive analysis to explore language models' memorization during fine-tuning across tasks. Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that memorization presents a strong disparity among different fine-tuning tasks. We provide an intuitive explanation of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution.
arXiv Detail & Related papers (2023-10-10T15:41:26Z)
Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation [92.42032403795879]
We show that pretrained language models (LMs) such as GPT2 still tend to generate repetitive texts. We attribute their overestimation of token-level repetition probabilities to the learning bias. We find that LMs use longer-range dependencies to predict repetitive tokens than non-repetitive ones, which may be the cause of sentence-level repetition loops.
arXiv Detail & Related papers (2023-07-04T07:53:55Z)
Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.