Related papers: Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models

Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2406.12042v1
Date: Mon, 17 Jun 2024 19:22:04 GMT
Title: Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models
Authors: Alireza Ganjdanesh, Reza Shirkavand, Shangqian Gao, Heng Huang,
Abstract summary: We introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method for text-to-image (T2I) diffusion models. APTP learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores.
Score: 59.16287352266203
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image (T2I) diffusion models have demonstrated impressive image generation capabilities. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internal target data. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is a prompt router model, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP's effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g., prompts for generating text images, assigning them to higher capacity codes.

Related papers

Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection [17.590853105242864]
vision-language models (e.g. CLIP) have demonstrated remarkable performance in zero-shot anomaly detection (ZSAD) Bayes-PFL is designed to learn both image-specific and image-agnostic distributions, which are jointly utilized to regularize the text prompt space and improve the model's generalization on unseen categories. Experiments on 15 industrial and medical datasets demonstrate our method's superior performance.
arXiv Detail & Related papers (2025-03-13T06:05:35Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts. They struggle to support the consistent generation of identity-preserving requirements for storytelling. We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
Differentiable Prompt Learning for Vision Language Models [49.132774679968456]
We propose a differentiable prompt learning method dubbed differentiable prompt learning (DPL) DPL is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence.
arXiv Detail & Related papers (2024-12-31T14:13:28Z)
ChangeDiff: A Multi-Temporal Change Detection Data Generator with Flexible Text Prompts via Diffusion Model [21.50463332137926]
This paper focuses on the semantic CD (SCD) task and develops a multi-temporal SCD data generator ChangeDiff. ChangeDiff generates change data in two steps: first, it uses text prompts and a text-to-image model to create continuous layouts, and then it employs layout-to-image to convert these layouts into images. Our generated data shows significant progress in temporal continuity, spatial diversity, and quality realism, empowering change detectors with accuracy and transferability.
arXiv Detail & Related papers (2024-12-20T03:58:28Z)
(PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork [60.889175951038496]
Large-scale neural networks have demonstrated remarkable performance in different domains like vision and language processing. One of the key questions of structural pruning is how to estimate the channel significance. We propose a novel algorithmic framework, namely textttPASS. It is a tailored hyper-network to take both visual prompts and network weight statistics as input, and output layer-wise channel sparsity in a recurrent manner.
arXiv Detail & Related papers (2024-07-24T16:47:45Z)
Implicit and Explicit Language Guidance for Diffusion-based Visual Perception [42.71751651417168]
Text-to-image diffusion models can generate high-quality images with rich texture and reasonable structure under different text prompts. We propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP achieves promising performance on two typical perception tasks, including semantic segmentation and depth estimation.
arXiv Detail & Related papers (2024-04-11T09:39:58Z)
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [150.57983348059528]
PRISM is an algorithm that automatically identifies human-interpretable and transferable prompts. It can effectively generate desired concepts given only black-box access to T2I models. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles and images.
arXiv Detail & Related papers (2024-03-28T02:35:53Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models. Our method leverages a pretrained large language model for grounded generation in a novel two-stage process. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt. We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts. We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z)
A Unified Framework for Multi-intent Spoken Language Understanding with prompting [14.17726194025463]
We describe a Prompt-based Spoken Language Understanding (PromptSLU) framework, to intuitively unify two sub-tasks into the same form. In detail, ID and SF are completed by concisely filling the utterance into task-specific prompt templates as input, and sharing output formats of key-value pairs sequence. Experiment results show that our framework outperforms several state-of-the-art baselines on two public datasets.
arXiv Detail & Related papers (2022-10-07T05:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.