L-Verse: Bidirectional Generation Between Image and Text
- URL: http://arxiv.org/abs/2111.11133v3
- Date: Wed, 24 Nov 2021 02:58:34 GMT
- Title: L-Verse: Bidirectional Generation Between Image and Text
- Authors: Taehoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo,
Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae
- Abstract summary: L-Verse is a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART)
Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild.
L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks.
- Score: 41.133824156046394
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Far beyond learning long-range interactions of natural language, transformers
are becoming the de-facto standard for many vision tasks with their power and
scalabilty. Especially with cross-modal tasks between image and text, vector
quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB
image into a sequence of feature vectors. To better leverage the correlation
between image and text, we propose L-Verse, a novel architecture consisting of
feature-augmented variational autoencoder (AugVAE) and bidirectional
auto-regressive transformer (BiART) for text-to-image and image-to-text
generation. Our AugVAE shows the state-of-the-art reconstruction performance on
ImageNet1K validation set, along with the robustness to unseen images in the
wild. Unlike other models, BiART can distinguish between image (or text) as a
conditional reference and a generation target. L-Verse can be directly used for
image-to-text or text-to-image generation tasks without any finetuning or extra
object detection frameworks. In quantitative and qualitative experiments,
L-Verse shows impressive results against previous methods in both image-to-text
and text-to-image generation on MS-COCO Captions. We furthermore assess the
scalability of L-Verse architecture on Conceptual Captions and present the
initial results of bidirectional vision-language representation learning on
general domain. Codes available at: https://github.com/tgisaturday/L-Verse
Related papers
- Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.