Related papers: LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

URL: http://arxiv.org/abs/2402.16515v1
Date: Mon, 26 Feb 2024 11:52:55 GMT
Title: LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification
Authors: Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang, Dongsheng Li
Abstract summary: We propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.
Score: 67.92145284679623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.

Related papers

DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning [51.35628297101575]
Differentially private (DP) synthetic data generation plays a pivotal role in developing large language models (LLMs) on private data.<n>We introduce Differentially Private Reinforcement Fine-Tuning (DP-RFT), an online reinforcement learning algorithm for synthetic data generation with LLMs.<n>We evaluate DP-RFT for long-form and domain-specific synthetic data generation, such as news articles, meeting transcripts, and medical article abstracts.
arXiv Detail & Related papers (2026-02-20T22:03:56Z)
Auditing Approximate Machine Unlearning for Differentially Private Models [14.700807572189412]
We propose a holistic approach to auditing unlearned and retained samples' privacy risks after applying approximate unlearning algorithms.<n>Our experimental findings indicate that existing approximate machine unlearning algorithms may inadvertently compromise the privacy of retained samples for differentially private models.
arXiv Detail & Related papers (2025-08-26T04:29:33Z)
DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models [5.247930659596986]
We aim to address the challenges within the context of document image classification by substituting real private data with a synthetic counterpart.<n>In particular, we propose to use conditional latent diffusion models (LDMs) in combination with differential privacy (DP) to generate class-specific synthetic document images.<n>We show that our approach achieves substantial performance improvements in downstream evaluations on small-scale datasets.
arXiv Detail & Related papers (2025-08-06T08:43:08Z)
Machine Learning with Privacy for Protected Attributes [56.44253915927481]
We refine the definition of differential privacy (DP) to create a more general and flexible framework that we call feature differential privacy (FDP)<n>Our definition is simulation-based and allows for both addition/removal and replacement variants of privacy, and can handle arbitrary separation of protected and non-protected features.<n>We apply our framework to various machine learning tasks and show that it can significantly improve the utility of DP-trained models when public features are available.
arXiv Detail & Related papers (2025-06-24T17:53:28Z)
RAPID: Retrieval Augmented Training of Differentially Private Diffusion Models [26.66607257183987]
We present RAPID: Retrieval Augmented PrIvate Diffusion model. It is a novel approach that integrates retrieval augmented generation into DPDM training. It significantly outperforms state-of-the-art approaches by large margins in generative quality, memory footprint, and inference cost.
arXiv Detail & Related papers (2025-02-18T11:56:51Z)
Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning [16.028575596905554]
We propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL generates prototypes that represent each private class in the embedding space and can be publicly released for inference. We show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder.
arXiv Detail & Related papers (2024-06-12T09:41:12Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Probing the Transition to Dataset-Level Privacy in ML Models Using an Output-Specific and Data-Resolved Privacy Profile [23.05994842923702]
We study a privacy metric that quantifies the extent to which a model trained on a dataset using a Differential Privacy mechanism is covered" by each of the distributions resulting from training on neighboring datasets. We show that the privacy profile can be used to probe an observed transition to indistinguishability that takes place in the neighboring distributions as $epsilon$ decreases.
arXiv Detail & Related papers (2023-06-27T20:39:07Z)
Arbitrary Decisions are a Hidden Cost of Differentially Private Training [7.560688419767116]
Mechanisms used in machine learning often aim to guarantee differential privacy (DP) during model training. Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data. For a given input example, the output predicted by equally-private models depends on the randomness used in training.
arXiv Detail & Related papers (2023-02-28T12:13:43Z)
A Prototype-Oriented Clustering for Domain Shift with Source Privacy [66.67700676888629]
We introduce Prototype-oriented Clustering with Distillation (PCD) to improve the performance and applicability of existing methods. PCD first constructs a source clustering model by aligning the distributions of prototypes and data. It then distills the knowledge to the target model through cluster labels provided by the source model while simultaneously clustering the target data.
arXiv Detail & Related papers (2023-02-08T00:15:35Z)
Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge. Existing private generative models are struggling with the utility of synthetic samples. We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z)
Differentially Private Diffusion Models [46.46256537222917]
We build on the recent success of diffusion models (DMs) and introduce Differentially Private Diffusion Models (DPDMs) We propose noise multiplicity, a powerful modification of DP-SGD tailored to the training of DMs. We validate our novel DPDMs on image generation benchmarks and achieve state-of-the-art performance in all experiments.
arXiv Detail & Related papers (2022-10-18T15:20:47Z)
An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition [51.20130423303659]
We propose an ensemble learning framework with Poisson sub-sampling to train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms, namely: (i) a privacy budget amplification via Poisson sub-sampling to train a target prediction model that requires less noise to achieve a same level of privacy budget, and (ii) a combination of the sub-sampling technique and an ensemble teacher-student learning framework.
arXiv Detail & Related papers (2022-10-12T16:34:08Z)
Personalized PATE: Differential Privacy for Machine Learning with Individual Privacy Guarantees [1.2691047660244335]
We propose three novel methods to support training an ML model with different personalized privacy guarantees within the training data. Our experiments show that our personalized privacy methods yield higher accuracy models than the non-personalized baseline.
arXiv Detail & Related papers (2022-02-21T20:16:27Z)
Don't Generate Me: Training Differentially Private Generative Models with Sinkhorn Divergence [73.14373832423156]
We propose DP-Sinkhorn, a novel optimal transport-based generative method for learning data distributions from private data with differential privacy. Unlike existing approaches for training differentially private generative models, we do not rely on adversarial objectives.
arXiv Detail & Related papers (2021-11-01T18:10:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.