Related papers: Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

URL: http://arxiv.org/abs/2204.07537v1
Date: Fri, 15 Apr 2022 16:29:55 GMT
Title: Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer
Authors: Hyungyung Lee, Sungjin Park, Edward Choi
Abstract summary: We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space. We can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation.
Score: 8.069590683507997
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Though deep generative models have gained a lot of attention, most of the existing works are designed for the unimodal generation task. In this paper, we explore a new method for unconditional image-text pair generation. We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space, so that the image-text pair can be converted to a sequence of unified indices. Then we can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation. Extensive experimental results demonstrate that our approach effectively generates semantically consistent image-text pair and also enhances meaningful alignment between image and text.

Related papers

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z)
TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z)
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z)
Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling. We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation. Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning. It is designed to pre-train text-to-image caption generators through three novel generation tasks. XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.