Instilling Multi-round Thinking to Text-guided Image Generation
- URL: http://arxiv.org/abs/2401.08472v2
- Date: Sat, 9 Mar 2024 15:52:05 GMT
- Title: Instilling Multi-round Thinking to Text-guided Image Generation
- Authors: Lidong Zeng, Zhedong Zheng, Yinwei Wei, Tat-seng Chua
- Abstract summary: Single-round generation often overlooks crucial details, particularly in the realm of fine-grained changes like shoes or sleeves.
We introduce a new self-supervised regularization, ie, multi-round regularization, which is compatible with existing methods.
It builds upon the observation that the modification order generally should not affect the final result.
- Score: 72.2032630115201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper delves into the text-guided image editing task, focusing on
modifying a reference image according to user-specified textual feedback to
embody specific attributes. Despite recent advancements, a persistent challenge
remains that the single-round generation often overlooks crucial details,
particularly in the realm of fine-grained changes like shoes or sleeves. This
issue compounds over multiple rounds of interaction, severely limiting
customization quality. In an attempt to address this challenge, we introduce a
new self-supervised regularization, \ie, multi-round regularization, which is
compatible with existing methods. Specifically, the multi-round regularization
encourages the model to maintain consistency across different modification
orders. It builds upon the observation that the modification order generally
should not affect the final result. Different from traditional one-round
generation, the mechanism underpinning the proposed method is the error
amplification of initially minor inaccuracies in capturing intricate details.
Qualitative and quantitative experiments affirm that the proposed method
achieves high-fidelity editing quality, especially the local modification, in
both single-round and multiple-round generation, while also showcasing robust
generalization to irregular text inputs. The effectiveness of our semantic
alignment with textual feedback is further substantiated by the retrieval
improvements on FahisonIQ and Fashion200k.
Related papers
- Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory [33.78620829249978]
Text-to-image (T2I) diffusion models have revolutionized generative modeling by producing high-fidelity, diverse, and visually realistic images.
Recent attention-based methods have improved object inclusion and linguistic binding, but still face challenges such as attribute misbinding.
We propose a Bayesian approach that designs custom priors over attention distributions to enforce desirable properties.
Our approach treats the attention mechanism as an interpretable component, enabling fine-grained control and improved attribute-object alignment.
arXiv Detail & Related papers (2024-11-25T10:57:48Z) - LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing [20.861672583434718]
LIPE is a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject.
We present LIPE, a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject, and subsequently employ the model with learned prior for non-rigid image editing.
arXiv Detail & Related papers (2024-06-25T02:56:16Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing [49.419619882284906]
Ground-A-Score is a powerful model-agnostic image editing method by incorporating grounding during score distillation.
The selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas.
Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts.
arXiv Detail & Related papers (2024-03-20T12:40:32Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image
Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods.
The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion
Models [74.3811832586391]
This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input.
Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps.
We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for
Text-Based Continuity-Sensitive Image Editing [24.9487669818162]
We propose atemporal guided adaptive editing algorithm AdapEdit, which realizes adaptive image editing.
Our approach has a significant advantage in preserving model priors and does not require model training, fine-tuning extra data, or optimization.
We present our results over a wide variety of raw images and editing instructions, demonstrating competitive performance and showing it significantly outperforms the previous approaches.
arXiv Detail & Related papers (2023-12-13T09:45:58Z) - Variational Bayesian Framework for Advanced Image Generation with
Domain-Related Variables [29.827191184889898]
We present a unified Bayesian framework for advanced conditional generative problems.
We propose a variational Bayesian image translation network (VBITN) that enables multiple image translation and editing tasks.
arXiv Detail & Related papers (2023-05-23T09:47:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.