Related papers: Nonparametric Data Attribution for Diffusion Models

Nonparametric Data Attribution for Diffusion Models

URL: http://arxiv.org/abs/2510.14269v1
Date: Thu, 16 Oct 2025 03:37:16 GMT
Title: Nonparametric Data Attribution for Diffusion Models
Authors: Yutian Zhao, Chao Du, Xiaosen Zheng, Tianyu Pang, Min Lin,
Abstract summary: Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs.<n>We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images.
Score: 57.820618036556084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs. Existing methods for diffusion models typically require access to model gradients or retraining, limiting their applicability in proprietary or large-scale settings. We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images. Our approach is grounded in the analytical form of the optimal score function and naturally extends to multiscale representations, while remaining computationally efficient through convolution-based acceleration. In addition to producing spatially interpretable attributions, our framework uncovers patterns that reflect intrinsic relationships between training data and outputs, independent of any specific model. Experiments demonstrate that our method achieves strong attribution performance, closely matching gradient-based approaches and substantially outperforming existing nonparametric baselines. Code is available at https://github.com/sail-sg/NDA.

Related papers

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
Influence Functions for Scalable Data Attribution in Diffusion Models [52.92223039302037]
Diffusion models have led to significant advancements in generative modelling.<n>Yet their widespread adoption poses challenges regarding data attribution and interpretability.<n>We develop an influence functions framework to address these challenges.
arXiv Detail & Related papers (2024-10-17T17:59:02Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Intriguing Properties of Data Attribution on Diffusion Models [33.77847454043439]
Data attribution seeks to trace desired outputs back to training data. Data attribution has become a module to properly assign for high-intuitive or copyrighted data.
arXiv Detail & Related papers (2023-11-01T13:00:46Z)
MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z)
Reflected Diffusion Models [93.26107023470979]
We present Reflected Diffusion Models, which reverse a reflected differential equation evolving on the support of the data. Our approach learns the score function through a generalized score matching loss and extends key components of standard diffusion models.
arXiv Detail & Related papers (2023-04-10T17:54:38Z)
Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z)
Wavelet-Based Hybrid Machine Learning Model for Out-of-distribution Internet Traffic Prediction [3.689539481706835]
This paper investigates machine learning performances using eXtreme Gradient Boosting, Light Gradient Boosting Machine, Gradient Descent, Gradient Boosting Regressor, Cat Regressor. We propose a hybrid machine learning model integrating wavelet decomposition for improving out-of-distribution prediction.
arXiv Detail & Related papers (2022-05-09T14:34:42Z)
Model-based Policy Optimization with Unsupervised Model Adaptation [37.09948645461043]
We investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization. We propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation. Our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2020-10-19T14:19:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.