Fine-grained Cross-modal Fusion based Refinement for Text-to-Image
Synthesis
- URL: http://arxiv.org/abs/2302.08706v2
- Date: Mon, 20 Feb 2023 09:38:50 GMT
- Title: Fine-grained Cross-modal Fusion based Refinement for Text-to-Image
Synthesis
- Authors: Haoran Sun, Yang Wang, Haipeng Liu, Biao Qian
- Abstract summary: We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN.
The FF-GAN consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR)
- Score: 12.954663420736782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis refers to generating visual-realistic and
semantically consistent images from given textual descriptions. Previous
approaches generate an initial low-resolution image and then refine it to be
high-resolution. Despite the remarkable progress, these methods are limited in
fully utilizing the given texts and could generate text-mismatched images,
especially when the text description is complex. We propose a novel
Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed
FF-GAN, which consists of two modules: Fine-grained text-image Fusion Block
(FF-Block) and Global Semantic Refinement (GSR). The proposed FF-Block
integrates an attention block and several convolution layers to effectively
fuse the fine-grained word-context features into the corresponding visual
features, in which the text information is fully used to refine the initial
image with more details. And the GSR is proposed to improve the global semantic
consistency between linguistic and visual features during the refinement
process. Extensive experiments on CUB-200 and COCO datasets demonstrate the
superiority of FF-GAN over other state-of-the-art approaches in generating
images with semantic consistency to the given texts.Code is available at
https://github.com/haoranhfut/FF-GAN.
Related papers
- Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model [30.739879255847946]
Existing multi-modal image fusion methods fail to address the compound degradations presented in source images.
This study proposes a novel interactive multi-modal image fusion framework based on the text-modulated diffusion model, called Text-DiFuse.
arXiv Detail & Related papers (2024-10-31T13:10:50Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion [26.809259323430368]
We introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task, termed as Text-IF.
Text-IF is accessible to the all-in-one infrared and visible image degradation-aware processing and the interactive flexible fusion outcomes.
In this way, Text-IF achieves not only multi-modal image fusion, but also multi-modal information fusion.
arXiv Detail & Related papers (2024-03-25T03:06:45Z) - Image Fusion via Vision-Language Model [91.36809431547128]
We introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM)
FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions.
These descriptions are fused within the textual domain and guide the visual information fusion.
FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion.
arXiv Detail & Related papers (2024-02-03T18:36:39Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.