Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image
Generation
- URL: http://arxiv.org/abs/2209.14046v1
- Date: Wed, 28 Sep 2022 12:28:54 GMT
- Title: Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image
Generation
- Authors: Xintian Wu, Hanbin Zhao, Liangli Zheng, Shouhong Ding, Xi Li
- Abstract summary: Text-to-image generation aims to generate photo-realistic and semantically consistent images according to the given text descriptions.
Existing methods mainly extract the text information from only one sentence to represent an image.
We propose an effective text representation method with the complements of attribute information.
- Score: 18.36261166580862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a challenging task, text-to-image generation aims to generate
photo-realistic and semantically consistent images according to the given text
descriptions. Existing methods mainly extract the text information from only
one sentence to represent an image and the text representation effects the
quality of the generated image well. However, directly utilizing the limited
information in one sentence misses some key attribute descriptions, which are
the crucial factors to describe an image accurately. To alleviate the above
problem, we propose an effective text representation method with the
complements of attribute information. Firstly, we construct an attribute memory
to jointly control the text-to-image generation with sentence input. Secondly,
we explore two update mechanisms, sample-aware and sample-joint mechanisms, to
dynamically optimize a generalized attribute memory. Furthermore, we design an
attribute-sentence-joint conditional generator learning scheme to align the
feature embeddings among multiple representations, which promotes the
cross-modal network training. Experimental results illustrate that the proposed
method obtains substantial performance improvements on both the CUB (FID from
14.81 to 8.57) and COCO (FID from 21.42 to 12.39) datasets.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [47.421888361871254]
Scene text images contain not only style information (font, background) but also content information (character, texture)
Previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance.
We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability.
arXiv Detail & Related papers (2024-05-07T15:00:11Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Improving Generalization of Image Captioning with Unsupervised Prompt
Learning [63.26197177542422]
Generalization of Image Captioning (GeneIC) learns a domain-specific prompt vector for the target domain without requiring annotated data.
GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model.
arXiv Detail & Related papers (2023-08-05T12:27:01Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Memory-Driven Text-to-Image Generation [126.58244124144827]
We introduce a memory-driven semi-parametric approach to text-to-image generation.
Non-parametric component is a memory bank of image features constructed from a training set of images.
parametric component is a generative adversarial network.
arXiv Detail & Related papers (2022-08-15T06:32:57Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.