Towards Privacy-Preserving Code Generation: Differentially Private Code Language Models
- URL: http://arxiv.org/abs/2512.11482v1
- Date: Fri, 12 Dec 2025 11:31:13 GMT
- Title: Towards Privacy-Preserving Code Generation: Differentially Private Code Language Models
- Authors: Melih Catal, Pooja Rani, Harald C. Gall,
- Abstract summary: This study systematically evaluates the effectiveness of Differential Privacy (DP) in CodeLLMs.<n>DP substantially reduces memorization in CodeLLMs across all the tested snippet types.<n>DP slightly increases perplexity but preserves, and can even enhance, the code generation capabilities of CodeLLMs.
- Score: 2.4216414826638353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models specialized for code (CodeLLMs) have demonstrated remarkable capabilities in generating code snippets, documentation, and test cases. However, despite their promising capabilities, CodeLLMs can inadvertently memorize and reproduce snippets from their training data, which poses risks of privacy breaches and intellectual property violations. These risks restrict the deployment of CodeLLMs in sensitive domains and limit their training datasets to publicly available sources. To mitigate the memorization risk without compromising their task performance, we apply Differential Privacy (DP) to CodeLLMs. To the best of our knowledge, this is the first comprehensive study that systematically evaluates the effectiveness of DP in CodeLLMs. DP adds calibrated noise to the training process to protect individual data points while still allowing the model to learn useful patterns. To this end, we first identify and understand the driving reasons of the memorization behaviour of the CodeLLMs during their fine-tuning. Then, to address this issue, we empirically evaluate the effect of DP on mitigating memorization while preserving code generation capabilities. Our findings show that DP substantially reduces memorization in CodeLLMs across all the tested snippet types. The snippet types most prone to memorization are also the most effectively mitigated by DP. Furthermore, we observe that DP slightly increases perplexity but preserves, and can even enhance, the code generation capabilities of CodeLLMs, which makes it feasible to apply DP in practice without significantly compromising model utility. Finally, we analyze the impact of DP on training efficiency and energy consumption, finding that DP does not significantly affect training time or energy usage, making it a practical choice for privacy-preserving CodeLLMs training.
Related papers
- Protecting Private Code in IDE Autocomplete using Differential Privacy [4.963509029377068]
This paper investigates the use of Differential Privacy (DP) as a robust defense mechanism for training an Large Language Models (LLMs)<n>We fine-tune a ttexttMellum model using DP and conduct a comprehensive evaluation of its privacy and utility.<n>Our results demonstrate that DP provides a strong defense against Membership Inference Attacks (MIAs), reducing the attack's success rate close to a random guess (AUC from 0.901 to 0.606).
arXiv Detail & Related papers (2026-01-30T12:51:43Z) - Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning [50.45435841411193]
Code Language Models (CLMs) exhibit unintended memorization of sensitive training data, enabling verbatim reproduction of confidential information when specifically prompted.<n>We introduce CodeEraser, an advanced variant that selectively unlearns sensitive memorized segments in code while preserving the structural integrity and functional correctness of the surrounding code.
arXiv Detail & Related papers (2025-09-17T07:12:35Z) - Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting [54.48306552577881]
We argue that large language models (LLMs) are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization.<n>Existing evaluations largely proxy neglecting surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and memorization task correctness.<n>We propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart.
arXiv Detail & Related papers (2025-03-04T05:39:24Z) - Pre-training Differentially Private Models with Limited Public Data [54.943023722114134]
differential privacy (DP) is a prominent method to gauge the degree of security provided to the models.
DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training stage.
We develop a novel DP continual pre-training strategy using only 10% of public data.
Our strategy can achieve DP accuracy of 41.5% on ImageNet-21k, as well as non-DP accuracy of 55.7% and and 60.0% on downstream tasks Places365 and iNaturalist-2021.
arXiv Detail & Related papers (2024-02-28T23:26:27Z) - DPZero: Private Fine-Tuning of Language Models without Backpropagation [49.365749361283704]
We introduce DPZero, a novel private zeroth-order algorithm with nearly dimension-independent rates.
The memory efficiency of DPZero is demonstrated in privately fine-tuning RoBERTa and OPT on several downstream tasks.
arXiv Detail & Related papers (2023-10-14T18:42:56Z) - Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced
Transfer Learning [66.20311762506702]
dataset pruning (DP) has emerged as an effective way to improve data efficiency.
We propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings.
We show that source data classes can be pruned by up to 40% 80% without sacrificing downstream performance.
arXiv Detail & Related papers (2023-10-13T00:07:49Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - When approximate design for fast homomorphic computation provides
differential privacy guarantees [0.08399688944263842]
Differential privacy (DP) and cryptographic primitives are popular countermeasures against privacy attacks.
In this paper, we design SHIELD, a probabilistic approximation algorithm for the argmax operator.
Even if SHIELD could have other applications, we here focus on one setting and seamlessly integrate it in the SPEED collaborative training framework.
arXiv Detail & Related papers (2023-04-06T09:38:01Z) - How to DP-fy ML: A Practical Guide to Machine Learning with Differential
Privacy [22.906644117887133]
Differential Privacy (DP) has become a gold standard for making formal statements about data anonymization.
The adoption of DP is hindered by limited practical guidance of what DP protection entails, what privacy guarantees to aim for, and the difficulty of achieving good privacy-utility-computation trade-offs for ML models.
This work is a self-contained guide that gives an in-depth overview of the field of DP ML and presents information about achieving the best possible DP ML model with rigorous privacy guarantees.
arXiv Detail & Related papers (2023-03-01T16:56:39Z) - Lifelong DP: Consistently Bounded Differential Privacy in Lifelong
Machine Learning [28.68587691924582]
We show that the process of continually learning new tasks and memorizing previous tasks introduces unknown privacy risks and challenges to bound the privacy loss.
We introduce a formal definition of Lifelong DP, in which the participation of any datas in the training set of any tasks is protected.
We propose a scalable and heterogeneous algorithm, called L2DP-ML, to efficiently train and continue releasing new versions of an L2M model.
arXiv Detail & Related papers (2022-07-26T11:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.