The Right Losses for the Right Gains: Improving the Semantic Consistency
of Deep Text-to-Image Generation with Distribution-Sensitive Losses
- URL: http://arxiv.org/abs/2312.10854v1
- Date: Mon, 18 Dec 2023 00:05:28 GMT
- Title: The Right Losses for the Right Gains: Improving the Semantic Consistency
of Deep Text-to-Image Generation with Distribution-Sensitive Losses
- Authors: Mahmoud Ahmed, Omer Moussa, Ismail Shaheen, Mohamed Abdelfattah, Amr
Abdalla, Marwan Eid, Hesham Eraqi, Mohamed Moustafa
- Abstract summary: We propose a contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss and fake-to-real loss.
We test this approach on two baseline models: SSAGAN and AttnGAN.
Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset.
- Score: 0.35898124827270983
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: One of the major challenges in training deep neural networks for
text-to-image generation is the significant linguistic discrepancy between
ground-truth captions of each image in most popular datasets. The large
difference in the choice of words in such captions results in synthesizing
images that are semantically dissimilar to each other and to their ground-truth
counterparts. Moreover, existing models either fail to generate the
fine-grained details of the image or require a huge number of parameters that
renders them inefficient for text-to-image synthesis. To fill this gap in the
literature, we propose using the contrastive learning approach with a novel
combination of two loss functions: fake-to-fake loss to increase the semantic
consistency between generated images of the same caption, and fake-to-real loss
to reduce the gap between the distributions of real images and fake ones. We
test this approach on two baseline models: SSAGAN and AttnGAN (with style
blocks to enhance the fine-grained details of the images.) Results show that
our approach improves the qualitative results on AttnGAN with style blocks on
the CUB dataset. Additionally, on the challenging COCO dataset, our approach
achieves competitive results against the state-of-the-art Lafite model,
outperforms the FID score of SSAGAN model by 44.
Related papers
- Enhanced Unsupervised Image-to-Image Translation Using Contrastive Learning and Histogram of Oriented Gradients [0.0]
This paper proposes an enhanced unsupervised image-to-image translation method based on the Contrastive Unpaired Translation (CUT) model.
This novel approach ensures the preservation of the semantic structure of images, even without semantic labels.
The method was tested on translating synthetic game environments from GTA5 dataset to realistic urban scenes in cityscapes dataset.
arXiv Detail & Related papers (2024-09-24T12:44:27Z) - Improving Text Generation on Images with Synthetic Captions [2.1175632266708733]
latent diffusion models such as SDXL and SD 1.5 have shown significant capability in generating realistic images.
We propose a low-cost approach by leveraging SDXL without any time-consuming training on large-scale datasets.
Our results demonstrate how our small scale fine-tuning approach can improve the accuracy of text generation in different scenarios.
arXiv Detail & Related papers (2024-06-01T17:27:34Z) - Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - Improving Text to Image Generation using Mode-seeking Function [5.92166950884028]
We develop a specialized mode-seeking loss function for the generation of distinct images.
We validate our model on the Caltech Birds dataset and the Microsoft COCO dataset.
Experimental results demonstrate that our model works very well compared to some state-of-the-art approaches.
arXiv Detail & Related papers (2020-08-19T12:58:32Z) - Deep Semantic Matching with Foreground Detection and Cycle-Consistency [103.22976097225457]
We address weakly supervised semantic matching based on a deep network.
We explicitly estimate the foreground regions to suppress the effect of background clutter.
We develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent.
arXiv Detail & Related papers (2020-03-31T22:38:09Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.