Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control
- URL: http://arxiv.org/abs/2503.18324v1
- Date: Mon, 24 Mar 2025 04:06:39 GMT
- Title: Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control
- Authors: Basim Azam, Naveed Akhtar,
- Abstract summary: We propose a unique technique to enable responsible T2I generation in a scalable manner.<n>The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts.<n>At inference, the learned space is utilized to modulate the generative content.
- Score: 28.030708956348864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ethical issues around text-to-image (T2I) models demand a comprehensive control over the generative content. Existing techniques addressing these issues for responsible T2I models aim for the generated content to be fair and safe (non-violent/explicit). However, these methods remain bounded to handling the facets of responsibility concepts individually, while also lacking in interpretability. Moreover, they often require alteration to the original model, which compromises the model performance. In this work, we propose a unique technique to enable responsible T2I generation by simultaneously accounting for an extensive range of concepts for fair and safe content generation in a scalable manner. The key idea is to distill the target T2I pipeline with an external plug-and-play mechanism that learns an interpretable composite responsible space for the desired concepts, conditioned on the target T2I pipeline. We use knowledge distillation and concept whitening to enable this. At inference, the learned space is utilized to modulate the generative content. A typical T2I pipeline presents two plug-in points for our approach, namely; the text embedding space and the diffusion model latent space. We develop modules for both points and show the effectiveness of our approach with a range of strong results.
Related papers
- T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.
We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.
Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers [33.195628798316754]
EraseAnything is the first method specifically developed to address concept erasure within the latest flow-based T2I framework.<n>We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer.<n>We propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones.
arXiv Detail & Related papers (2024-12-29T09:42:53Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.
This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.
We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On [78.33688031340698]
TED-VITON is a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features.<n>These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity.
arXiv Detail & Related papers (2024-11-26T01:00:09Z) - MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models [51.1034358143232]
We introduce component-controllable personalization, a new task that allows users to customize and reconfigure individual components within concepts.<n>This task faces two challenges: semantic pollution, where undesirable elements distort the concept, and semantic imbalance, which leads to disproportionate learning of the target concept and component.<n>We design MagicTailor, a framework that uses Dynamic Masked Degradation to adaptively perturb unwanted visual semantics and Dual-Stream Balancing for more balanced learning of desired visual semantics.
arXiv Detail & Related papers (2024-10-17T09:22:53Z) - SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges.<n>We propose SAFREE, a training-free approach for safe T2I and T2V.<n>We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z) - Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation [149.96612254604986]
PRISM is an algorithm that automatically produces human-interpretable and transferable prompts.
Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution.
Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models.
arXiv Detail & Related papers (2024-03-28T02:35:53Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Box It to Bind It: Unified Layout Control and Attribute Binding in T2I
Diffusion Models [28.278822620442774]
Box-it-to-Bind-it (B2B) is a training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models.
B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance.
B2B is designed as a compatible plug-and-play module for existing T2I models.
arXiv Detail & Related papers (2024-02-27T21:51:32Z) - $λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space [61.091910046492345]
$lambda$-ECLIPSE works in the latent space of a pre-trained CLIP model without relying on the diffusion UNet models.
$lambda$-ECLIPSE performs multisubject driven P-T2I with just 34M parameters and is trained on a mere 74 GPU hours.
arXiv Detail & Related papers (2024-02-07T19:07:10Z) - InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models [43.62338454684645]
We study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information.
We propose a pluggable interaction control model, called InteractDiffusion, that extends existing pre-trained T2I diffusion models.
Our model attains the ability to control the interaction and location on existing T2I diffusion models.
arXiv Detail & Related papers (2023-12-10T10:35:16Z) - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z) - LeftRefill: Filling Right Canvas based on Left Reference through
Generalized Text-to-Image Diffusion Model [55.20469538848806]
LeftRefill is an innovative approach to harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis.
arXiv Detail & Related papers (2023-05-19T10:29:42Z) - T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for
Text-to-Image Diffusion Models [29.280739915676737]
We learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals.
Our T2I-Adapter has promising generation quality and a wide range of applications.
arXiv Detail & Related papers (2023-02-16T17:56:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.