Fine-grained Text to Image Synthesis
- URL: http://arxiv.org/abs/2412.07196v2
- Date: Sun, 15 Dec 2024 22:56:40 GMT
- Title: Fine-grained Text to Image Synthesis
- Authors: Xu Ouyang, Ying Chen, Kaiyue Zhu, Gady Agam,
- Abstract summary: Fine-grained text to image synthesis involves generating images from texts that belong to different categories.
An auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images.
We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset.
- Score: 4.742847798631037
- License:
- Abstract: Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different subclasses, and there may be linguistic discrepancy among texts describing the same image. Recent Generative Adversarial Networks (GAN), such as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize clear and realistic images from texts. However, GAN models ignore fine-grained level information. In this paper we propose an approach that incorporates an auxiliary classifier in the discriminator and a contrastive learning method to improve the accuracy of fine-grained details in images synthesized by RAT GAN. The auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images. The contrastive learning method minimizes the similarity between images from different subclasses and maximizes the similarity between images from the same subclass. We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated superior performance.
Related papers
- DXAI: Explaining Classification by Image Decomposition [4.013156524547072]
We propose a new way to visualize neural network classification through a decomposition-based explainable AI (DXAI)
Instead of providing an explanation heatmap, our method yields a decomposition of the image into class-agnostic and class-distinct parts.
arXiv Detail & Related papers (2023-12-30T20:52:20Z) - Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Recurrent Affine Transformation for Text-to-image Synthesis [5.256132101498471]
Existing methods usually adaptively fuse suitable text information into the synthesis process with isolated fusion blocks.
We propose a Recurrent Affine Transformation (RAT) for Generative Adrial Networks that connects all the fusion blocks with a recurrent neural network to model their long-term dependency.
Being aware of matching image regions, text descriptions supervise the generator to synthesize more relevant image contents.
arXiv Detail & Related papers (2022-04-22T03:49:47Z) - StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [52.341186561026724]
Lacking compositionality could have severe implications for robustness and fairness.
We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis.
Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
arXiv Detail & Related papers (2022-03-29T17:59:50Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - Synthesize-It-Classifier: Learning a Generative Classifier through
RecurrentSelf-analysis [9.029985847202667]
We show the generative capability of an image classifier network by synthesizing high-resolution, photo-realistic, and diverse images at scale.
The overall methodology, called Synthesize-It-Classifier (STIC), does not require an explicit generator network to estimate the density of the data distribution.
We demonstrate an Attentive-STIC network that shows an iterative drawing of synthesized images on the ImageNet dataset.
arXiv Detail & Related papers (2021-03-26T02:00:29Z) - Image-to-Image Translation with Text Guidance [139.41321867508722]
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks.
We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, and (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators.
arXiv Detail & Related papers (2020-02-12T21:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.