SurrogateSHAP: Training-Free Contributor Attribution for Text-to-Image (T2I) Models
- URL: http://arxiv.org/abs/2601.22276v1
- Date: Thu, 29 Jan 2026 19:48:19 GMT
- Title: SurrogateSHAP: Training-Free Contributor Attribution for Text-to-Image (T2I) Models
- Authors: Mingyu Lu, Soham Gadgil, Chris Lin, Chanwoo Kim, Su-In Lee,
- Abstract summary: SurrogateSHAP is a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model.<n>We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data.
- Score: 24.06687457570142
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Text-to-Image (T2I) diffusion models are increasingly used in real-world creative workflows, a principled framework for valuing contributors who provide a collection of data is essential for fair compensation and sustainable data marketplaces. While the Shapley value offers a theoretically grounded approach to attribution, it faces a dual computational bottleneck: (i) the prohibitive cost of exhaustive model retraining for each sampled subset of players (i.e., data contributors) and (ii) the combinatorial number of subsets needed to estimate marginal contributions due to contributor interactions. To this end, we propose SurrogateSHAP, a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model. To further improve efficiency, we employ a gradient-boosted tree to approximate the utility function and derive Shapley values analytically from the tree-based model. We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data. Across settings, SurrogateSHAP outperforms prior methods while substantially reducing computational overhead, consistently identifying influential contributors across multiple utility metrics. Finally, we demonstrate that SurrogateSHAP effectively localizes data sources responsible for spurious correlations in clinical images, providing a scalable path toward auditing safety-critical generative models.
Related papers
- RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models [14.093802378976315]
Diffusion-based remote sensing (RS) generative foundation models rely on large amounts of globally representative data.<n>We propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios.<n> Experiments show that, even after pruning 85% of the training data, our method significantly improves convergence and generation quality.
arXiv Detail & Related papers (2025-12-29T06:44:06Z) - FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model [89.23522479092537]
We propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2.<n>By leveraging the data mixing law, our method ensures a balanced dataset composition.<n>Our method achieves favorable performance against state-of-the-art approaches.
arXiv Detail & Related papers (2025-12-10T03:10:52Z) - Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation [61.248535801314375]
Subset-Selected Counterfactual Augmentation (SS-CA)<n>We develop Counterfactual LIMA to identify minimal spatial region sets whose removal can selectively alter model predictions.<n>Experiments show that SS-CA improves generalization on in-distribution (ID) test data and achieves superior performance on out-of-distribution (OOD) benchmarks.
arXiv Detail & Related papers (2025-11-15T08:39:22Z) - DUET: Dual Model Co-Training for Entire Space CTR Prediction [34.35929309131385]
textbfDUET (textbfDUal Model Co-Training for textbfDUal Model Co-Training for textbfEntire Space CtextbfTR Prediction) is a set-wise pre-ranking framework that achieves expressive modeling under tight computational budgets.<n>It consistently outperforms state-of-the-art baselines and achieves improvements across multiple core business metrics.
arXiv Detail & Related papers (2025-10-28T12:46:33Z) - Nonparametric Data Attribution for Diffusion Models [57.820618036556084]
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs.<n>We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images.
arXiv Detail & Related papers (2025-10-16T03:37:16Z) - A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation [59.88864205383671]
Source-Free Unsupervised Domain Adaptation (SFUDA) addresses the realistic challenge of adapting a source-trained model to a target domain without access to the source data.<n>Existing SFUDA methods either exploit only the source model's predictions or fine-tune large multimodal models.<n>We propose the Experts Cooperative Learning (EXCL) to exploit complementary insights and the latent structure of target data.
arXiv Detail & Related papers (2025-09-26T11:39:50Z) - MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources [113.33902847941941]
Variance-Aware Sampling (VAS) is a data selection strategy guided by Variance Promotion Score (VPS)<n>We release large-scale, carefully curated resources containing 1.6M long CoT cold-start data and 15k RL QA pairs.<n> Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS.
arXiv Detail & Related papers (2025-09-25T14:58:29Z) - $α$-TCVAE: On the relationship between Disentanglement and Diversity [21.811889512977924]
In this work, we introduce $alpha$-TCVAE, a variational autoencoder optimized using a novel total correlation (TC) lower bound.
We present quantitative analyses that support the idea that disentangled representations lead to better generative capabilities and diversity.
Our results demonstrate that $alpha$-TCVAE consistently learns more disentangled representations than baselines and generates more diverse observations.
arXiv Detail & Related papers (2024-11-01T13:50:06Z) - An Efficient Framework for Crediting Data Contributors of Diffusion Models [13.761241561734547]
We introduce a method to efficiently retrain and rerun inference for Shapley value estimation.<n>We evaluate the utility of our method with three use cases: (i) image quality for a DDPM trained on a CIFAR dataset, (ii) demographic diversity for an LDM trained on CelebA-HQ, and (iii) aesthetic quality for a Stable Diffusion model LoRA-finetuned on Post-Impressionist artworks.
arXiv Detail & Related papers (2024-06-09T17:42:09Z) - Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study [61.65123150513683]
multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results.
It is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet.
We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark.
arXiv Detail & Related papers (2024-03-15T17:33:49Z) - Intriguing Properties of Data Attribution on Diffusion Models [33.77847454043439]
Data attribution seeks to trace desired outputs back to training data.
Data attribution has become a module to properly assign for high-intuitive or copyrighted data.
arXiv Detail & Related papers (2023-11-01T13:00:46Z) - Uncovering the Hidden Cost of Model Compression [43.62624133952414]
Visual Prompting has emerged as a pivotal method for transfer learning in computer vision.
Model compression detrimentally impacts the performance of visual prompting-based transfer.
However, negative effects on calibration are not present when models are compressed via quantization.
arXiv Detail & Related papers (2023-08-29T01:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.