LLM-based Privacy Data Augmentation Guided by Knowledge Distillation
with a Distribution Tutor for Medical Text Classification
- URL: http://arxiv.org/abs/2402.16515v1
- Date: Mon, 26 Feb 2024 11:52:55 GMT
- Title: LLM-based Privacy Data Augmentation Guided by Knowledge Distillation
with a Distribution Tutor for Medical Text Classification
- Authors: Yiping Song, Juhua Zhang, Zhiliang Tian, Yuxin Yang, Minlie Huang,
Dongsheng Li
- Abstract summary: We propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost.
We theoretically analyze our model's privacy protection and empirically verify our model.
- Score: 67.92145284679623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As sufficient data are not always publically accessible for model training,
researchers exploit limited data with advanced learning algorithms or expand
the dataset via data augmentation (DA). Conducting DA in private domain
requires private protection approaches (i.e. anonymization and perturbation),
but those methods cannot provide protection guarantees. Differential privacy
(DP) learning methods theoretically bound the protection but are not skilled at
generating pseudo text samples with large models. In this paper, we transfer
DP-based pseudo sample generation task to DP-based generated samples
discrimination task, where we propose a DP-based DA method with a LLM and a
DP-based discriminator for text classification on private domains. We construct
a knowledge distillation model as the DP-based discriminator: teacher models,
accessing private data, teaches students how to select private samples with
calibrated noise to achieve DP. To constrain the distribution of DA's
generation, we propose a DP-based tutor that models the noised private
distribution and controls samples' generation with a low privacy cost. We
theoretically analyze our model's privacy protection and empirically verify our
model.
Related papers
- Beyond the Mean: Differentially Private Prototypes for Private Transfer Learning [16.028575596905554]
We propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning.
DPPL generates prototypes that represent each private class in the embedding space and can be publicly released for inference.
We show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder.
arXiv Detail & Related papers (2024-06-12T09:41:12Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Probing the Transition to Dataset-Level Privacy in ML Models Using an
Output-Specific and Data-Resolved Privacy Profile [23.05994842923702]
We study a privacy metric that quantifies the extent to which a model trained on a dataset using a Differential Privacy mechanism is covered" by each of the distributions resulting from training on neighboring datasets.
We show that the privacy profile can be used to probe an observed transition to indistinguishability that takes place in the neighboring distributions as $epsilon$ decreases.
arXiv Detail & Related papers (2023-06-27T20:39:07Z) - Arbitrary Decisions are a Hidden Cost of Differentially Private Training [7.560688419767116]
Mechanisms used in machine learning often aim to guarantee differential privacy (DP) during model training.
Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data.
For a given input example, the output predicted by equally-private models depends on the randomness used in training.
arXiv Detail & Related papers (2023-02-28T12:13:43Z) - A Prototype-Oriented Clustering for Domain Shift with Source Privacy [66.67700676888629]
We introduce Prototype-oriented Clustering with Distillation (PCD) to improve the performance and applicability of existing methods.
PCD first constructs a source clustering model by aligning the distributions of prototypes and data.
It then distills the knowledge to the target model through cluster labels provided by the source model while simultaneously clustering the target data.
arXiv Detail & Related papers (2023-02-08T00:15:35Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Differentially Private Diffusion Models [46.46256537222917]
We build on the recent success of diffusion models (DMs) and introduce Differentially Private Diffusion Models (DPDMs)
We propose noise multiplicity, a powerful modification of DP-SGD tailored to the training of DMs.
We validate our novel DPDMs on image generation benchmarks and achieve state-of-the-art performance in all experiments.
arXiv Detail & Related papers (2022-10-18T15:20:47Z) - An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling
to Differential Privacy Preserving Speech Recognition [51.20130423303659]
We propose an ensemble learning framework with Poisson sub-sampling to train a collection of teacher models to issue some differential privacy (DP) guarantee for training data.
Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection.
Our proposed solution leverages upon two mechanisms, namely: (i) a privacy budget amplification via Poisson sub-sampling to train a target prediction model that requires less noise to achieve a same level of privacy budget, and (ii) a combination of the sub-sampling technique and an ensemble teacher-student learning framework.
arXiv Detail & Related papers (2022-10-12T16:34:08Z) - Personalized PATE: Differential Privacy for Machine Learning with
Individual Privacy Guarantees [1.2691047660244335]
We propose three novel methods to support training an ML model with different personalized privacy guarantees within the training data.
Our experiments show that our personalized privacy methods yield higher accuracy models than the non-personalized baseline.
arXiv Detail & Related papers (2022-02-21T20:16:27Z) - Don't Generate Me: Training Differentially Private Generative Models
with Sinkhorn Divergence [73.14373832423156]
We propose DP-Sinkhorn, a novel optimal transport-based generative method for learning data distributions from private data with differential privacy.
Unlike existing approaches for training differentially private generative models, we do not rely on adversarial objectives.
arXiv Detail & Related papers (2021-11-01T18:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.