Training Data Protection with Compositional Diffusion Models
- URL: http://arxiv.org/abs/2308.01937v4
- Date: Sun, 13 Oct 2024 22:32:43 GMT
- Title: Training Data Protection with Compositional Diffusion Models
- Authors: Aditya Golatkar, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto,
- Abstract summary: Compartmentalized Diffusion Models (CDM) are a method to train different diffusion models (or prompts) on distinct data sources.
Individual models can be trained in isolation, at different times, and on different distributions and domains.
Each model only contains information about a subset of the data it was exposed to during training, enabling several forms of training data protection.
- Score: 99.46239561159953
- License:
- Abstract: We introduce Compartmentalized Diffusion Models (CDM), a method to train different diffusion models (or prompts) on distinct data sources and arbitrarily compose them at inference time. The individual models can be trained in isolation, at different times, and on different distributions and domains and can be later composed to achieve performance comparable to a paragon model trained on all data simultaneously. Furthermore, each model only contains information about the subset of the data it was exposed to during training, enabling several forms of training data protection. In particular, CDMs enable perfect selective forgetting and continual learning for large-scale diffusion models, allow serving customized models based on the user's access rights. Empirically the quality (FID) of the class-conditional CDMs (8-splits) is within 10% (on fine-grained vision datasets) of a monolithic model (no splits), and allows (8x) faster forgetting compared monolithic model with a maximum FID increase of 1%. When applied to text-to-image generation, CDMs improve alignment (TIFA) by 14.33% over a monolithic model trained on MSCOCO. CDMs also allow determining the importance of a subset of the data (attribution) in generating particular samples, and reduce memorization.
Related papers
- Exploring Federated Deep Learning for Standardising Naming Conventions
in Radiotherapy Data [0.18749305679160366]
Standardising structure volume names in radiotherapy (RT) data is necessary to enable data mining and analyses.
No studies have considered that RT patient records are distributed across multiple data centres.
This paper introduces a method that emulates real-world environments to establish standardised nomenclature.
A multimodal deep artificial neural network was proposed to standardise RT data in federated settings.
arXiv Detail & Related papers (2024-02-14T07:52:28Z) - The Journey, Not the Destination: How Data Guides Diffusion Models [75.19694584942623]
Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity.
We propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions.
arXiv Detail & Related papers (2023-12-11T08:39:43Z) - On Memorization in Diffusion Models [46.656797890144105]
We show that memorization behaviors tend to occur on smaller-sized datasets.
We quantify the impact of the influential factors on these memorization behaviors in terms of effective model memorization (EMM)
Our study holds practical significance for diffusion model users and offers clues to theoretical research in deep generative models.
arXiv Detail & Related papers (2023-10-04T09:04:20Z) - Distributional Inclusion Hypothesis and Quantifications: Probing for
Hypernymy in Functional Distributional Semantics [50.363809539842386]
Functional Distributional Semantics (FDS) models the meaning of words by truth-conditional functions.
We show that FDS models learn hypernymy on a restricted class of corpus that strictly follows the Distributional Inclusion Hypothesis (DIH)
arXiv Detail & Related papers (2023-09-15T11:28:52Z) - Phoenix: A Federated Generative Diffusion Model [6.09170287691728]
Training generative models on large centralized datasets can pose challenges in terms of data privacy, security, and accessibility.
This paper proposes a novel method for training a Denoising Diffusion Probabilistic Model (DDPM) across multiple data sources using Federated Learning (FL) techniques.
arXiv Detail & Related papers (2023-06-07T01:43:09Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Few-Shot Diffusion Models [15.828257653106537]
We present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs.
FSDM is trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information.
We empirically show that FSDM can perform few-shot generation and transfer to new datasets.
arXiv Detail & Related papers (2022-05-30T23:20:33Z) - Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model.
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.