Memorize or Generalize? Evaluating LLM Code Generation with Evolved Questions
- URL: http://arxiv.org/abs/2503.02296v1
- Date: Tue, 04 Mar 2025 05:39:24 GMT
- Title: Memorize or Generalize? Evaluating LLM Code Generation with Evolved Questions
- Authors: Wentao Chen, Lizhe Zhang, Li Zhong, Letian Peng, Zilong Wang, Jingbo Shang,
- Abstract summary: Large Language Models (LLMs) are known to exhibit a memorization phenomenon in code generation.<n>In this paper, we investigate this phenomenon by designing three evolution strategies to create variants: mutation, paraphrasing, and code-rewriting.<n>As expected, as supervised fine-tuning goes on, the memorization score rises before overfitting, suggesting more severe memorization.
- Score: 33.58518352911762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are known to exhibit a memorization phenomenon in code generation: instead of truly understanding the underlying principles of a programming problem, they tend to memorize the original prompt and its solution together in the training. Consequently, when facing variants of the original problem, their answers very likely resemble the memorized solutions and fail to generalize. In this paper, we investigate this phenomenon by designing three evolution strategies to create variants: mutation, paraphrasing, and code-rewriting. By comparing the performance and AST similarity of the LLM-generated codes before and after these three evolutions, we develop a memorization score that positively correlates with the level of memorization. As expected, as supervised fine-tuning goes on, the memorization score rises before overfitting, suggesting more severe memorization. We demonstrate that common mitigation approaches, such as prompt translation and using evolved variants as data augmentation in supervised learning and reinforcement learning, either compromise the performance or fail to alleviate the memorization issue. Therefore, memorization remains a significant challenge in LLM code generation, highlighting the need for a more effective solution.
Related papers
- Memorization Sinks: Isolating Memorization during LLM Training [20.682505625638203]
Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns.<n>We propose a new paradigm of MemSinks that promotes isolation of memorization by design.<n>This is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable.
arXiv Detail & Related papers (2025-07-14T05:23:27Z) - Rethinking Repetition Problems of LLMs in Code Generation [36.42947561896802]
We propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar.<n> RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions.<n>Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset.
arXiv Detail & Related papers (2025-05-15T15:26:32Z) - The Pitfalls of Memorization: When Memorization Hurts Generalization [28.5600484308805]
Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns.<n>We propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits.
arXiv Detail & Related papers (2024-12-10T17:18:33Z) - On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.
One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.
We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z) - Unlocking Memorization in Large Language Models with Dynamic Soft Prompting [66.54460367290146]
Large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation.
LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement.
We propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts.
arXiv Detail & Related papers (2024-09-20T18:56:32Z) - Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications.
We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences.
We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - To Each (Textual Sequence) Its Own: Improving Memorized-Data Unlearning in Large Language Models [3.4990427823966828]
LLMs have been found to memorize training textual sequences and regurgitate verbatim said sequences during text generation time.
This fact is known to be the cause of privacy and related (e.g., copyright) problems.
Unlearning in LLMs then takes the form of devising new algorithms that will properly deal with these side-effects.
arXiv Detail & Related papers (2024-05-06T01:21:50Z) - Exploring Memorization in Fine-tuned Language Models [53.52403444655213]
We conduct the first comprehensive analysis to explore language models' memorization during fine-tuning across tasks.
Our studies with open-sourced and our own fine-tuned LMs across various tasks indicate that memorization presents a strong disparity among different fine-tuning tasks.
We provide an intuitive explanation of this task disparity via sparse coding theory and unveil a strong correlation between memorization and attention score distribution.
arXiv Detail & Related papers (2023-10-10T15:41:26Z) - Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim.
This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others).
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.