Related papers: Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

URL: http://arxiv.org/abs/2405.01356v1
Date: Thu, 2 May 2024 15:03:41 GMT
Title: Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance
Authors: Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang,
Abstract summary: We show that through constructing a subject-agnostic condition, one could obtain outputs consistent with both the given subject and input text prompts. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements.
Score: 62.15866177242207
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt. In this work, we propose Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance, one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally, we demonstrate its applicability in second-order customization methods, where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements, as evidenced by our evaluations and user studies.

Related papers

Visual Consensus Prompting for Co-Salient Object Detection [26.820772908765083]
We propose an interaction-effective and parameter-efficient concise architecture for the co-salient object detection task. A parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP) OurVCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset)
arXiv Detail & Related papers (2025-04-19T10:12:39Z)
Generating Multi-Image Synthetic Data for Text-to-Image Customization [48.59231755159313]
Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision. We propose a simple approach that addresses both limitations.
arXiv Detail & Related papers (2025-02-03T18:59:41Z)
Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL [29.01858866450715]
We present RLPrompt, which aims to find optimal prompt tokens leveraging soft Q-learning. While the results show promise, we have observed that the prompts frequently appear unnatural, which impedes their interpretability. We address this limitation by using sparse Tsallis entropy regularization, a principled approach to filtering out unlikely tokens from consideration.
arXiv Detail & Related papers (2024-07-20T03:10:19Z)
Tuning-Free Image Customization with Image and Text Guidance [65.9504243633169]
We introduce a tuning-free framework for simultaneous text-image-guided image customization. Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. Our approach outperforms previous methods in both human and quantitative evaluations.
arXiv Detail & Related papers (2024-03-19T11:48:35Z)
Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z)
Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z)
ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models [21.15548013842187]
We propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings. By dropping the text encoder, we are able to significantly speed up the learning process. Our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks.
arXiv Detail & Related papers (2023-05-30T12:45:49Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z)
Cross Modification Attention Based Deliberation Model for Image Captioning [11.897899189552318]
We propose a universal two-pass decoding framework for image captioning. A single-pass decoding based model first generates a draft caption according to an input image. A Deliberation Model then performs the polishing process to refine the draft caption to a better image description.
arXiv Detail & Related papers (2021-09-17T08:38:08Z)
PerceptionGAN: Real-world Image Construction from Provided Text through Perceptual Understanding [11.985768957782641]
We propose a method to provide good images by incorporating perceptual understanding in the discriminator module. We show that the perceptual information included in the initial image is improved while modeling image distribution at multiple stages. More importantly, the proposed method can be integrated into the pipeline of other state-of-the-art text-based-image-generation models.
arXiv Detail & Related papers (2020-07-02T09:23:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.