Protecting Private Code in IDE Autocomplete using Differential Privacy
- URL: http://arxiv.org/abs/2601.22935v1
- Date: Fri, 30 Jan 2026 12:51:43 GMT
- Title: Protecting Private Code in IDE Autocomplete using Differential Privacy
- Authors: Evgeny Grigorenko, David Stanojević, David Ilić, Egor Bogomolov, Kostadin Cvejoski,
- Abstract summary: This paper investigates the use of Differential Privacy (DP) as a robust defense mechanism for training an Large Language Models (LLMs)<n>We fine-tune a ttexttMellum model using DP and conduct a comprehensive evaluation of its privacy and utility.<n>Our results demonstrate that DP provides a strong defense against Membership Inference Attacks (MIAs), reducing the attack's success rate close to a random guess (AUC from 0.901 to 0.606).
- Score: 4.963509029377068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern Integrated Development Environments (IDEs) increasingly leverage Large Language Models (LLMs) to provide advanced features like code autocomplete. While powerful, training these models on user-written code introduces significant privacy risks, making the models themselves a new type of data vulnerability. Malicious actors can exploit this by launching attacks to reconstruct sensitive training data or infer whether a specific code snippet was used for training. This paper investigates the use of Differential Privacy (DP) as a robust defense mechanism for training an LLM for Kotlin code completion. We fine-tune a \texttt{Mellum} model using DP and conduct a comprehensive evaluation of its privacy and utility. Our results demonstrate that DP provides a strong defense against Membership Inference Attacks (MIAs), reducing the attack's success rate close to a random guess (AUC from 0.901 to 0.606). Furthermore, we show that this privacy guarantee comes at a minimal cost to model performance, with the DP-trained model achieving utility scores comparable to its non-private counterpart, even when trained on 100x less data. Our findings suggest that DP is a practical and effective solution for building private and trustworthy AI-powered IDE features.
Related papers
- Neighborhood Blending: A Lightweight Inference-Time Defense Against Membership Inference Attacks [5.468130838517792]
We introduce a novel inference-time defense mechanism called Neighborhood Blending.<n>Our method establishes a consistent confidence pattern, rendering members and non-members indistinguishable to an adversary.<n>It is a model-agnostic approach that offers a practical, lightweight solution that enhances privacy without sacrificing model utility.
arXiv Detail & Related papers (2026-02-13T14:01:21Z) - Towards Privacy-Preserving Code Generation: Differentially Private Code Language Models [2.4216414826638353]
This study systematically evaluates the effectiveness of Differential Privacy (DP) in CodeLLMs.<n>DP substantially reduces memorization in CodeLLMs across all the tested snippet types.<n>DP slightly increases perplexity but preserves, and can even enhance, the code generation capabilities of CodeLLMs.
arXiv Detail & Related papers (2025-12-12T11:31:13Z) - Machine Learning with Privacy for Protected Attributes [56.44253915927481]
We refine the definition of differential privacy (DP) to create a more general and flexible framework that we call feature differential privacy (FDP)<n>Our definition is simulation-based and allows for both addition/removal and replacement variants of privacy, and can handle arbitrary separation of protected and non-protected features.<n>We apply our framework to various machine learning tasks and show that it can significantly improve the utility of DP-trained models when public features are available.
arXiv Detail & Related papers (2025-06-24T17:53:28Z) - Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning [62.224804688233]
differential privacy (DP) offers a promising solution by ensuring models are 'almost indistinguishable' with or without any particular privacy unit.
We study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users.
arXiv Detail & Related papers (2024-06-20T13:54:32Z) - Closed-Form Bounds for DP-SGD against Record-level Inference [18.85865832127335]
We focus on the popular DP-SGD algorithm, and derive simple closed-form bounds.
We obtain bounds for membership inference that match state-of-the-art techniques.
We present a novel data-dependent bound against attribute inference.
arXiv Detail & Related papers (2024-02-22T09:26:16Z) - Diffence: Fencing Membership Privacy With Diffusion Models [19.682186550192856]
Deep learning models are vulnerable to membership inference attacks (MIAs)<n>We introduce a novel defense framework against MIAs by leveraging generative models.<n>Our defense, called DIFFENCE, works pre inference, which is unlike prior defenses that are either training-time or post-inference time.
arXiv Detail & Related papers (2023-12-07T20:45:09Z) - When approximate design for fast homomorphic computation provides
differential privacy guarantees [0.08399688944263842]
Differential privacy (DP) and cryptographic primitives are popular countermeasures against privacy attacks.
In this paper, we design SHIELD, a probabilistic approximation algorithm for the argmax operator.
Even if SHIELD could have other applications, we here focus on one setting and seamlessly integrate it in the SPEED collaborative training framework.
arXiv Detail & Related papers (2023-04-06T09:38:01Z) - RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target.
RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead.
Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z) - Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent [69.14164921515949]
We characterize privacy guarantees for individual examples when releasing models trained by DP-SGD.
We find that most examples enjoy stronger privacy guarantees than the worst-case bound.
This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees.
arXiv Detail & Related papers (2022-06-06T13:49:37Z) - Large Scale Transfer Learning for Differentially Private Image
Classification [51.10365553035979]
Differential Privacy (DP) provides a formal framework for training machine learning models with individual example level privacy.
Private training using DP-SGD protects against leakage by injecting noise into individual example gradients.
While this result is quite appealing, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training.
arXiv Detail & Related papers (2022-05-06T01:22:20Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.