RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine
Semantic Re-alignment
- URL: http://arxiv.org/abs/2305.19599v3
- Date: Mon, 27 Nov 2023 12:50:09 GMT
- Title: RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine
Semantic Re-alignment
- Authors: Guian Fang, Zutao Jiang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai
Liao, Xiaodan Liang
- Abstract summary: We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
- Score: 91.13260535010842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in text-to-image diffusion models have achieved remarkable
success in generating high-quality, realistic images from textual descriptions.
However, these approaches have faced challenges in precisely aligning the
generated visual content with the textual concepts described in the prompts. In
this paper, we propose a two-stage coarse-to-fine semantic re-alignment method,
named RealignDiff, aimed at improving the alignment between text and images in
text-to-image diffusion models. In the coarse semantic re-alignment phase, a
novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the
semantic discrepancy between the generated image caption and the given text
prompt. Subsequently, the fine semantic re-alignment stage employs a local
dense caption generation module and a re-weighting attention modulation module
to refine the previously generated images from a local semantic view.
Experimental results on the MS-COCO benchmark demonstrate that the proposed
two-stage coarse-to-fine semantic re-alignment method outperforms other
baseline re-alignment techniques by a substantial margin in both visual quality
and semantic similarity with the input prompt.
Related papers
- Generating Intermediate Representations for Compositional Text-To-Image Generation [16.757550214291015]
We propose a compositional approach for text-to-image generation based on two stages.
In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations conditioned on text.
In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model.
arXiv Detail & Related papers (2024-10-13T10:24:55Z) - Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis [60.260724486834164]
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries.
We present two key innovations: Vision Guidance and the Layered Rendering Diffusion framework.
We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.
arXiv Detail & Related papers (2023-11-30T10:36:19Z) - Improving Compositional Text-to-image Generation with Large
Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts.
We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts.
Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.