Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2406.12042v3
- Date: Tue, 11 Feb 2025 15:58:10 GMT
- Title: Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models
- Authors: Alireza Ganjdanesh, Reza Shirkavand, Shangqian Gao, Heng Huang,
- Abstract summary: We introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method for text-to-image (T2I) models.
APTP learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts.
APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores.
- Score: 59.16287352266203
- License:
- Abstract: Text-to-image (T2I) diffusion models have demonstrated impressive image generation capabilities. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internal target data. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is a prompt router model, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP's effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g. prompts for generating text images, assigning them to higher capacity codes.
Related papers
- One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.
They struggle to support the consistent generation of identity-preserving requirements for storytelling.
We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z) - Differentiable Prompt Learning for Vision Language Models [49.132774679968456]
We propose a differentiable prompt learning method dubbed differentiable prompt learning (DPL)
DPL is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer.
We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence.
arXiv Detail & Related papers (2024-12-31T14:13:28Z) - ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model [21.50463332137926]
This paper focuses on the semantic CD (SCD) task and develops a multi-temporal SCD data generator ChangeDiff.
ChangeDiff generates change data in two steps: first, it uses text prompts and a text-to-image model to create continuous layouts, and then it employs layout-to-image to convert these layouts into images.
Our generated data shows significant progress in temporal continuity, spatial diversity, and quality realism, empowering change detectors with accuracy and transferability.
arXiv Detail & Related papers (2024-12-20T03:58:28Z) - Implicit and Explicit Language Guidance for Diffusion-based Visual Perception [42.71751651417168]
Text-to-image diffusion models can generate high-quality images with rich texture and reasonable structure under different text prompts.
We propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP.
Our IEDP achieves promising performance on two typical perception tasks, including semantic segmentation and depth estimation.
arXiv Detail & Related papers (2024-04-11T09:39:58Z) - Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [149.96612254604986]
PRISM is an algorithm that automatically identifies human-interpretable and transferable prompts.
It can effectively generate desired concepts given only black-box access to T2I models.
Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images.
arXiv Detail & Related papers (2024-03-28T02:35:53Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - A Unified Framework for Multi-intent Spoken Language Understanding with
prompting [14.17726194025463]
We describe a Prompt-based Spoken Language Understanding (PromptSLU) framework, to intuitively unify two sub-tasks into the same form.
In detail, ID and SF are completed by concisely filling the utterance into task-specific prompt templates as input, and sharing output formats of key-value pairs sequence.
Experiment results show that our framework outperforms several state-of-the-art baselines on two public datasets.
arXiv Detail & Related papers (2022-10-07T05:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.