Improving Text to Image Generation using Mode-seeking Function
- URL: http://arxiv.org/abs/2008.08976v4
- Date: Fri, 18 Sep 2020 20:00:56 GMT
- Title: Improving Text to Image Generation using Mode-seeking Function
- Authors: Naitik Bhise, Zhenfei Zhang, Tien D. Bui
- Abstract summary: We develop a specialized mode-seeking loss function for the generation of distinct images.
We validate our model on the Caltech Birds dataset and the Microsoft COCO dataset.
Experimental results demonstrate that our model works very well compared to some state-of-the-art approaches.
- Score: 5.92166950884028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative Adversarial Networks (GANs) have long been used to understand the
semantic relationship between the text and image. However, there are problems
with mode collapsing in the image generation that causes some preferred output
modes. Our aim is to improve the training of the network by using a specialized
mode-seeking loss function to avoid this issue. In the text to image synthesis,
our loss function differentiates two points in latent space for the generation
of distinct images. We validate our model on the Caltech Birds (CUB) dataset
and the Microsoft COCO dataset by changing the intensity of the loss function
during the training. Experimental results demonstrate that our model works very
well compared to some state-of-the-art approaches.
Related papers
- Multi-Scale Texture Loss for CT denoising with GANs [0.9349653765341301]
Generative Adversarial Networks (GANs) have proved as a powerful framework for denoising applications in medical imaging.
This work presents a loss function that leverages the intrinsic multi-scale nature of the Gray-Level-Co-occurrence Matrix (GLCM)
Our approach also introduces a self-attention layer that dynamically aggregates the multi-scale texture information extracted from the images.
arXiv Detail & Related papers (2024-03-25T11:28:52Z) - The Right Losses for the Right Gains: Improving the Semantic Consistency
of Deep Text-to-Image Generation with Distribution-Sensitive Losses [0.35898124827270983]
We propose a contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss and fake-to-real loss.
We test this approach on two baseline models: SSAGAN and AttnGAN.
Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset.
arXiv Detail & Related papers (2023-12-18T00:05:28Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - Intra-Modal Constraint Loss For Image-Text Retrieval [10.496611712280972]
Cross-modal retrieval has drawn much attention in computer vision and natural language processing domains.
With the development of convolutional and recurrent neural networks, the bottleneck of retrieval across image-text modalities is no longer the extraction of image and text features but an efficient loss function learning in embedding space.
This paper proposes a method for learning joint embedding of images and texts using an intra-modal constraint loss function to reduce the violation of negative pairs from the same homogeneous modality.
arXiv Detail & Related papers (2022-07-11T17:21:25Z) - Image Restoration by Deep Projected GSURE [115.57142046076164]
Ill-posed inverse problems appear in many image processing applications, such as deblurring and super-resolution.
We propose a new image restoration framework that is based on minimizing a loss function that includes a "projected-version" of the Generalized SteinUnbiased Risk Estimator (GSURE) and parameterization of the latent image by a CNN.
arXiv Detail & Related papers (2021-02-04T08:52:46Z) - A Loss Function for Generative Neural Networks Based on Watson's
Perceptual Model [14.1081872409308]
To train Variational Autoencoders (VAEs) to generate realistic imagery requires a loss function that reflects human perception of image similarity.
We propose such a loss function based on Watson's perceptual model, which computes a weighted distance in frequency space and accounts for luminance and contrast masking.
In experiments, VAEs trained with the new loss function generated realistic, high-quality image samples.
arXiv Detail & Related papers (2020-06-26T15:36:11Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.