Provably Protecting Fine-Tuned LLMs from Training Data Extraction
- URL: http://arxiv.org/abs/2602.00688v1
- Date: Sat, 31 Jan 2026 12:18:36 GMT
- Title: Provably Protecting Fine-Tuned LLMs from Training Data Extraction
- Authors: Tom Segal, Asaf Shabtai, Yuval Elovici,
- Abstract summary: Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns.<n>We propose SCP-$_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities.<n> SCP-$_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods.
- Score: 27.190752375819972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$Δ_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$Δ_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.
Related papers
- LSHFed: Robust and Communication-Efficient Federated Learning with Locally-Sensitive Hashing Gradient Mapping [27.641729042448194]
Federated learning (FL) enables collaborative model training across distributed nodes without exposing raw data.<n>Inference attacks may recover sensitive information from gradient updates, while poisoning attacks can degrade model performance or induce malicious behaviors.<n>We propose LSHFed, a robust and communication-efficient FL framework that simultaneously enhances aggregation and privacy preservation.
arXiv Detail & Related papers (2025-11-03T07:28:14Z) - Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler [67.24175911858312]
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models.<n>Bayesian Data Scheduler (BDS) is an adaptive tuning-stage defense strategy with no need for attack simulation.<n>BDS learns the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets.
arXiv Detail & Related papers (2025-10-31T04:49:37Z) - SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks [17.77094760401298]
We study the vulnerability of fine-tuned large language models to membership inference attacks (MIAs)<n>We propose SOFT, a novel defense technique that mitigates privacy leakage by leveraging influential data selection with an adjustable parameter to balance utility preservation and privacy protection.
arXiv Detail & Related papers (2025-06-12T07:23:56Z) - Secure Distributed Learning for CAVs: Defending Against Gradient Leakage with Leveled Homomorphic Encryption [0.0]
Homomorphic Encryption (HE) offers a promising alternative to Differential Privacy (DP) and Secure Multi-Party Computation (SMPC)<n>We evaluate various HE schemes to identify the most suitable for Federated Learning (FL) in resource-constrained environments.<n>We develop a full HE-based FL pipeline that effectively mitigates Deep Leakage from Gradients (DLG) attacks while preserving model accuracy.
arXiv Detail & Related papers (2025-06-09T16:12:18Z) - No Query, No Access [50.18709429731724]
We introduce the textbfVictim Data-based Adrial Attack (VDBA), which operates using only victim texts.<n>To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods.<n>Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08%.
arXiv Detail & Related papers (2025-05-12T06:19:59Z) - Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data [73.04828796123581]
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs)<n>We introduce Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data.<n>Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness
arXiv Detail & Related papers (2025-02-25T22:38:55Z) - PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning [8.61459170031022]
This paper introduces a novel security threat to FedPEFT, termed PEFT-as-an-Attack (PaaA)<n>Our evaluation of PaaA reveals that with less than 1% of the model's parameters set as trainable, and a small subset of clients acting maliciously, the attack achieves an approximate 80% attack success rate using representative PEFT methods such as LoRA.<n>Our results underscore the urgent need for more effective defense mechanisms that simultaneously ensure security and maintain the performance of the FedPEFT paradigm.
arXiv Detail & Related papers (2024-11-28T19:05:01Z) - Efficient and Private: Memorisation under differentially private parameter-efficient fine-tuning in language models [2.3281513013731145]
Fine-tuning large language models (LLMs) for specific tasks introduces privacy risks, as models may inadvertently memorise and leak sensitive training data.
Differential Privacy (DP) offers a solution to mitigate these risks, but introduces significant computational and performance trade-offs.
We show that PEFT methods achieve comparable performance to standard fine-tuning while requiring fewer parameters and significantly reducing privacy leakage.
arXiv Detail & Related papers (2024-11-24T13:17:36Z) - Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning [9.366691198320261]
Membership inference attacks (MIAs) are used to test practical privacy of machine learning models.<n>We show that the vulnerability of non-DP models when measured as the attacker advantage at a fixed false positive rate reduces according to a simple power law.
arXiv Detail & Related papers (2024-02-07T14:23:01Z) - Doubly Robust Instance-Reweighted Adversarial Training [107.40683655362285]
We propose a novel doubly-robust instance reweighted adversarial framework.
Our importance weights are obtained by optimizing the KL-divergence regularized loss function.
Our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance.
arXiv Detail & Related papers (2023-08-01T06:16:18Z) - Learning, compression, and leakage: Minimising classification error via
meta-universal compression principles [87.054014983402]
A promising group of compression techniques for learning scenarios is normalised maximum likelihood (NML) coding.
Here we consider a NML-based decision strategy for supervised classification problems, and show that it attains PAC learning when applied to a wide variety of models.
We show that the misclassification rate of our method is upper bounded by the maximal leakage, a recently proposed metric to quantify the potential of data leakage in privacy-sensitive scenarios.
arXiv Detail & Related papers (2020-10-14T20:03:58Z) - Differentially Private Federated Learning with Laplacian Smoothing [72.85272874099644]
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users.
An adversary may still be able to infer the private training data by attacking the released model.
Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models.
arXiv Detail & Related papers (2020-05-01T04:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.