R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation
- URL: http://arxiv.org/abs/2310.08872v5
- Date: Mon, 27 Nov 2023 08:42:07 GMT
- Title: R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation
- Authors: Jiayu Xiao, Henglei Lv, Liang Li, Shuhui Wang, Qingming Huang
- Abstract summary: We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
- Score: 74.5598315066249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image (T2I) diffusion models have achieved remarkable progress
in generating high-quality images given text-prompts as input. However, these
models fail to convey appropriate spatial composition specified by a layout
instruction. In this work, we probe into zero-shot grounded T2I generation with
diffusion models, that is, generating images corresponding to the input layout
information without training auxiliary modules or finetuning diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach
that gradually modulates the attention maps of diffusion model during
generative process, and assists the model to synthesize images (1) with high
fidelity, (2) highly compatible with textual input, and (3) interpreting layout
instructions accurately. Specifically, we leverage the discrete sampling to
bridge the gap between consecutive attention maps and discrete layout
constraints, and design a region-aware loss to refine the generative layout
during diffusion process. We further propose a boundary-aware loss to
strengthen object discriminability within the corresponding regions.
Experimental results show that our method outperforms existing state-of-the-art
zero-shot grounded T2I generation methods by a large margin both qualitatively
and quantitatively on several benchmarks.
Related papers
- Boundary Attention Constrained Zero-Shot Layout-To-Image Generation [47.435234391588494]
Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting.
We propose a novel zero-shot L2I approach, BACON, which eliminates the need for additional modules or fine-tuning.
We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z) - Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss.
Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z) - LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation [28.079441901818296]
We propose a text-to-3D method for Neural Radiance Fields (NeRFs) that explicitly enforces fine-grained view consistency.
Our method achieves state-of-the-art performance over existing text-to-3D methods.
arXiv Detail & Related papers (2023-12-19T01:09:49Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Grounded Text-to-Image Synthesis with Attention Refocusing [16.9170825951175]
We reveal the potential causes in the diffusion model's cross-attention and self-attention layers.
We propose two novel losses to refocus attention maps according to a given spatial layout during sampling.
We show that our proposed attention refocusing effectively improves the controllability of existing approaches.
arXiv Detail & Related papers (2023-06-08T17:59:59Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis [38.22195812238951]
We propose a novel guidance approach for the sampling process in the diffusion model.
Our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints.
Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process.
arXiv Detail & Related papers (2023-04-28T00:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.