How Quantization Impacts Privacy Risk on LLMs for Code?
- URL: http://arxiv.org/abs/2508.00128v1
- Date: Thu, 31 Jul 2025 19:28:31 GMT
- Title: How Quantization Impacts Privacy Risk on LLMs for Code?
- Authors: Md Nazmul Haque, Hua Yang, Zhou Yang, Bowen Xu,
- Abstract summary: We conduct the first empirical study on how quantization influences task performance and privacy risk simultaneously in LLMs4Code.<n>Our results demonstrate that quantization has a significant impact on reducing the privacy risk relative to the original model.<n>We also uncover a positive correlation between task performance and privacy risk, indicating an underlying tradeoff.
- Score: 8.607910400111853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models for code (LLMs4Code) rely heavily on massive training data, including sensitive data, such as cloud service credentials of the projects and personal identifiable information of the developers, raising serious privacy concerns. Membership inference (MI) has recently emerged as an effective tool for assessing privacy risk by identifying whether specific data belong to a model's training set. In parallel, model compression techniques, especially quantization, have gained traction for reducing computational costs and enabling the deployment of large models. However, while quantized models still retain knowledge learned from the original training data, it remains unclear whether quantization affects their ability to retain and expose privacy information. Answering this question is of great importance to understanding privacy risks in real-world deployments. In this work, we conduct the first empirical study on how quantization influences task performance and privacy risk simultaneously in LLMs4Code. To do this, we implement widely used quantization techniques (static and dynamic) to three representative model families, namely Pythia, CodeGen, and GPTNeo. Our results demonstrate that quantization has a significant impact on reducing the privacy risk relative to the original model. We also uncover a positive correlation between task performance and privacy risk, indicating an underlying tradeoff. Moreover, we reveal the possibility that quantizing larger models could yield better balance than using full-precision small models. Finally, we demonstrate that these findings generalize across different architectures, model sizes and MI methods, offering practical guidance for safeguarding privacy when deploying compressed LLMs4Code.
Related papers
- When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning [9.660010886245155]
This paper systematically investigates the privacy threats posed by membership inference attacks (MIAs) targeting encoder models.<n>We propose a novel membership inference attack method based on the p-norm of feature vectors, termed the Embedding Lp-Norm Likelihood Attack (LpLA)
arXiv Detail & Related papers (2025-06-06T05:03:29Z) - Multi-Objective Optimization-Based Anonymization of Structured Data for Machine Learning Application [0.5452584641316627]
Various techniques have been proposed to address privacy concerns in data sharing.<n>These methods often degrade data utility, impacting the performance of machine learning (ML) models.<n>We propose a novel multi-objective optimization model that simultaneously minimizes information loss and maximizes protection against attacks.
arXiv Detail & Related papers (2025-01-02T01:52:36Z) - On the Privacy Risk of In-context Learning [36.633860818454984]
We show that deploying prompted models presents a significant privacy risk for the data used within the prompt.
We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels.
arXiv Detail & Related papers (2024-11-15T17:11:42Z) - Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack.
When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model.
Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z) - Initialization Matters: Privacy-Utility Analysis of Overparameterized
Neural Networks [72.51255282371805]
We prove a privacy bound for the KL divergence between model distributions on worst-case neighboring datasets.
We find that this KL privacy bound is largely determined by the expected squared gradient norm relative to model parameters during training.
arXiv Detail & Related papers (2023-10-31T16:13:22Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Privacy Preserving Large Language Models: ChatGPT Case Study Based Vision and Framework [6.828884629694705]
This article proposes the conceptual model called PrivChatGPT, a privacy-generative model for LLMs.
PrivChatGPT consists of two main components i.e., preserving user privacy during the data curation/pre-processing together with preserving private context and the private training process for large-scale data.
arXiv Detail & Related papers (2023-10-19T06:55:13Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated
Learning [77.27443885999404]
Federated Learning (FL) is a setting for training machine learning models in distributed environments.
We propose a novel method, CANIFE, that uses carefully crafted samples by a strong adversary to evaluate the empirical privacy of a training round.
arXiv Detail & Related papers (2022-10-06T13:30:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.