An Efficient Framework for Crediting Data Contributors of Diffusion Models
- URL: http://arxiv.org/abs/2407.03153v2
- Date: Wed, 22 Jan 2025 18:21:13 GMT
- Title: An Efficient Framework for Crediting Data Contributors of Diffusion Models
- Authors: Chris Lin, Mingyu Lu, Chanwoo Kim, Su-In Lee,
- Abstract summary: We introduce a method to efficiently retrain and rerun inference for Shapley value estimation.
We evaluate the utility of our method with three use cases: (i) image quality for a DDPM trained on a CIFAR dataset, (ii) demographic diversity for an LDM trained on CelebA-HQ, and (iii) aesthetic quality for a Stable Diffusion model LoRA-finetuned on Post-Impressionist artworks.
- Score: 13.761241561734547
- License:
- Abstract: As diffusion models are deployed in real-world settings, and their performance is driven by training data, appraising the contribution of data contributors is crucial to creating incentives for sharing quality data and to implementing policies for data compensation. Depending on the use case, model performance corresponds to various global properties of the distribution learned by a diffusion model (e.g., overall aesthetic quality). Hence, here we address the problem of attributing global properties of diffusion models to data contributors. The Shapley value provides a principled approach to valuation by uniquely satisfying game-theoretic axioms of fairness. However, estimating Shapley values for diffusion models is computationally impractical because it requires retraining on many training data subsets corresponding to different contributors and rerunning inference. We introduce a method to efficiently retrain and rerun inference for Shapley value estimation, by leveraging model pruning and fine-tuning. We evaluate the utility of our method with three use cases: (i) image quality for a DDPM trained on a CIFAR dataset, (ii) demographic diversity for an LDM trained on CelebA-HQ, and (iii) aesthetic quality for a Stable Diffusion model LoRA-finetuned on Post-Impressionist artworks. Our results empirically demonstrate that our framework can identify important data contributors across models' global properties, outperforming existing attribution methods for diffusion models.
Related papers
- Influence Functions for Scalable Data Attribution in Diffusion Models [52.92223039302037]
Diffusion models have led to significant advancements in generative modelling.
Yet their widespread adoption poses challenges regarding data attribution and interpretability.
We develop an influence functions framework to address these challenges.
arXiv Detail & Related papers (2024-10-17T17:59:02Z) - MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models.
We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z) - Model-Based Diffusion for Trajectory Optimization [8.943418808959494]
We introduce Model-Based Diffusion (MBD), an optimization approach using the diffusion process to solve trajectory optimization (TO) problems without data.
Although MBD does not require external data, it can be naturally integrated with data of diverse qualities to steer the diffusion process.
MBD outperforms state-of-the-art reinforcement learning and sampling-based TO methods in challenging contact-rich tasks.
arXiv Detail & Related papers (2024-05-28T22:14:25Z) - Transfer Learning for Diffusion Models [43.10840361752551]
Diffusion models consistently produce high-quality synthetic samples.
They can be impractical in real-world applications due to high collection costs or associated risks.
This paper introduces the Transfer Guided Diffusion Process (TGDP), a novel approach distinct from conventional finetuning and regularization methods.
arXiv Detail & Related papers (2024-05-27T06:48:58Z) - The Journey, Not the Destination: How Data Guides Diffusion Models [75.19694584942623]
Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity.
We propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions.
arXiv Detail & Related papers (2023-12-11T08:39:43Z) - Leveraging Foundation Models to Improve Lightweight Clients in Federated
Learning [16.684749528240587]
Federated Learning (FL) is a distributed training paradigm that enables clients scattered across the world to cooperatively learn a global model without divulging confidential data.
FL faces a significant challenge in the form of heterogeneous data distributions among clients, which leads to a reduction in performance and robustness.
We introduce foundation model distillation to assist in the federated training of lightweight client models and increase their performance under heterogeneous data settings while keeping inference costs low.
arXiv Detail & Related papers (2023-11-14T19:10:56Z) - Intriguing Properties of Data Attribution on Diffusion Models [33.77847454043439]
Data attribution seeks to trace desired outputs back to training data.
Data attribution has become a module to properly assign for high-intuitive or copyrighted data.
arXiv Detail & Related papers (2023-11-01T13:00:46Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - Training Data Attribution for Diffusion Models [1.1733780065300188]
We propose a novel solution that reveals how training data influence the output of diffusion models through the use of ensembles.
In our approach individual models in an encoded ensemble are trained on carefully engineered splits of the overall training data to permit the identification of influential training examples.
The resulting model ensembles enable efficient ablation of training data influence, allowing us to assess the impact of training data on model outputs.
arXiv Detail & Related papers (2023-06-03T18:36:12Z) - FairIF: Boosting Fairness in Deep Learning via Influence Functions with
Validation Set Sensitive Attributes [51.02407217197623]
We propose a two-stage training algorithm named FAIRIF.
It minimizes the loss over the reweighted data set where the sample weights are computed.
We show that FAIRIF yields models with better fairness-utility trade-offs against various types of bias.
arXiv Detail & Related papers (2022-01-15T05:14:48Z) - Mind the Trade-off: Debiasing NLU Models without Degrading the
In-distribution Performance [70.31427277842239]
We introduce a novel debiasing method called confidence regularization.
It discourages models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples.
We evaluate our method on three NLU tasks and show that, in contrast to its predecessors, it improves the performance on out-of-distribution datasets.
arXiv Detail & Related papers (2020-05-01T11:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.