Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork
- URL: http://arxiv.org/abs/2208.08493v1
- Date: Wed, 17 Aug 2022 19:25:00 GMT
- Title: Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork
- Authors: Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, John Collomosse
- Abstract summary: We develop an approach for text-to-image generation that embraces additional retrieval images, driven by a combination of implicit visual guidance loss and generative objectives.
We propose a novel hypernetwork modulated visual-text encoding scheme to predict the weight update of the encoding layer.
Experimental results show that our model guided with additional retrieval visual data outperforms existing GAN-based models.
- Score: 38.55086153299993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop an approach for text-to-image generation that embraces additional
retrieval images, driven by a combination of implicit visual guidance loss and
generative objectives. Unlike most existing text-to-image generation methods
which merely take the text as input, our method dynamically feeds cross-modal
search results into a unified training stage, hence improving the quality,
controllability and diversity of generation results. We propose a novel
hypernetwork modulated visual-text encoding scheme to predict the weight update
of the encoding layer, enabling effective transfer from visual information
(e.g. layout, content) into the corresponding latent domain. Experimental
results show that our model guided with additional retrieval visual data
outperforms existing GAN-based models. On COCO dataset, we achieve better FID
of $9.13$ with up to $3.5 \times$ fewer generator parameters, compared with the
state-of-the-art method.
Related papers
- Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models [0.0]
We present a novel method for improving text-to-image generation by combining Large Language Models with diffusion models.
Our approach incorporates semantic understanding from pre-trained LLMs to guide the generation process.
Our method significantly improves both the visual quality and alignment of generated images with text descriptions.
arXiv Detail & Related papers (2025-02-02T15:43:13Z) - Generating Multimodal Images with GAN: Integrating Text, Image, and Style [7.481665175881685]
We propose a multimodal image generation method based on Generative Adversarial Networks (GAN)
This method involves the design of a text encoder, an image feature extractor, and a style integration module.
Experimental results show that our method produces images with high clarity and consistency across multiple public datasets.
arXiv Detail & Related papers (2025-01-04T02:51:28Z) - Dataset Augmentation by Mixing Visual Concepts [3.5420134832331334]
This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models.
We adapt the diffusion model by conditioning it with real images and novel text embeddings.
Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.
arXiv Detail & Related papers (2024-12-19T19:42:22Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text
Retrieval [142.047662926209]
We propose a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model.
We generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module.
We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets.
arXiv Detail & Related papers (2022-07-29T01:21:54Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.