CoCa: Contrastive Captioners are Image-Text Foundation Models
- URL: http://arxiv.org/abs/2205.01917v1
- Date: Wed, 4 May 2022 07:01:14 GMT
- Title: CoCa: Contrastive Captioners are Image-Text Foundation Models
- Authors: Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba
Seyedhosseini, Yonghui Wu
- Abstract summary: Contrastive Captioner (CoCa) is a minimalist design to pretrain an image-text encoder-decoder foundation model.
By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead.
CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks.
- Score: 41.759438751996505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploring large-scale pretrained foundation models is of significant interest
in computer vision because these models can be quickly transferred to many
downstream tasks. This paper presents Contrastive Captioner (CoCa), a
minimalist design to pretrain an image-text encoder-decoder foundation model
jointly with contrastive loss and captioning loss, thereby subsuming model
capabilities from contrastive approaches like CLIP and generative methods like
SimVLM. In contrast to standard encoder-decoder transformers where all decoder
layers attend to encoder outputs, CoCa omits cross-attention in the first half
of decoder layers to encode unimodal text representations, and cascades the
remaining decoder layers which cross-attend to the image encoder for multimodal
image-text representations. We apply a contrastive loss between unimodal image
and text embeddings, in addition to a captioning loss on the multimodal decoder
outputs which predicts text tokens autoregressively. By sharing the same
computational graph, the two training objectives are computed efficiently with
minimal overhead. CoCa is pretrained end-to-end and from scratch on both
web-scale alt-text data and annotated images by treating all labels simply as
text, seamlessly unifying natural language supervision for representation
learning. Empirically, CoCa achieves state-of-the-art performance with
zero-shot transfer or minimal task-specific adaptation on a broad range of
downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700,
Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal
understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps).
Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1
accuracy, 90.6% with a frozen encoder and learned classification head, and new
state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.
Related papers
- DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Closed-Loop Transcription via Convolutional Sparse Coding [29.75613581643052]
Autoencoders often use generic deep networks as the encoder or decoder, which are difficult to interpret.
In this work, we make the explicit assumption that the image distribution is generated from a multistage convolution sparse coding (CSC)
Our method enjoys several side benefits, including more structured and interpretable representations, more stable convergence, and scalability to large datasets.
arXiv Detail & Related papers (2023-02-18T14:40:07Z) - CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification
without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks.
The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID.
The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z) - On the Importance of Image Encoding in Automated Chest X-Ray Report
Generation [4.843654097048771]
Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness.
There is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition.
automated radiology report generation can be a very helpful tool in clinical practice.
arXiv Detail & Related papers (2022-11-24T08:02:52Z) - Exploring Discrete Diffusion Models for Image Captioning [104.69608826164216]
We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility.
We propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO.
arXiv Detail & Related papers (2022-11-21T18:12:53Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.