CoBIT: A Contrastive Bi-directional Image-Text Generation Model
- URL: http://arxiv.org/abs/2303.13455v1
- Date: Thu, 23 Mar 2023 17:24:31 GMT
- Title: CoBIT: A Contrastive Bi-directional Image-Text Generation Model
- Authors: Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Baldridge,
Jiahui Yu
- Abstract summary: CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
- Score: 72.1700346308106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of vision and language has witnessed a proliferation of pre-trained
foundation models. Most existing methods are independently pre-trained with
contrastive objective like CLIP, image-to-text generative objective like PaLI,
or text-to-image generative objective like Parti. However, the three objectives
can be pre-trained on the same data, image-text pairs, and intuitively they
complement each other as contrasting provides global alignment capacity and
generation grants fine-grained understanding. In this work, we present a
Contrastive Bi-directional Image-Text generation model (CoBIT), which attempts
to unify the three pre-training objectives in one framework. Specifically,
CoBIT employs a novel unicoder-decoder structure, consisting of an image
unicoder, a text unicoder and a cross-modal decoder. The image/text unicoders
can switch between encoding and decoding in different tasks, enabling
flexibility and shared knowledge that benefits both image-to-text and
text-to-image generations. CoBIT achieves superior performance in image
understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE)
and text-based content creation, particularly in zero-shot scenarios. For
instance, 82.7% in zero-shot ImageNet classification, 9.37 FID score in
zero-shot text-to-image generation and 44.8 CIDEr in zero-shot captioning.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - L-Verse: Bidirectional Generation Between Image and Text [41.133824156046394]
L-Verse is a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART)
Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild.
L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks.
arXiv Detail & Related papers (2021-11-22T11:48:26Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.