DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training
- URL: http://arxiv.org/abs/2303.03032v1
- Date: Mon, 6 Mar 2023 11:02:47 GMT
- Title: DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training
- Authors: Wei Li, Linchao Zhu, Longyin Wen, Yi Yang
- Abstract summary: We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
- Score: 73.74291217502928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong
zero-shot transfer capability in many discriminative tasks. Their adaptation to
zero-shot image-conditioned text generation tasks has drawn increasing
interest. Prior arts approach to zero-shot captioning by either utilizing the
existing large language models (e.g., GPT-2) or pre-training the
encoder-decoder network in an end-to-end manner. In this work, we propose a
simple framework, named DeCap, for zero-shot captioning. We introduce a
lightweight visual-aware language decoder. This decoder is both data-efficient
and computation-efficient: 1) it only requires the text data for training,
easing the burden on the collection of paired data. 2) it does not require
end-to-end training. When trained with text-only data, the decoder takes the
text embedding extracted from the off-the-shelf CLIP encoder as a prefix
embedding. The challenge is that the decoder is trained on the text corpus but
at the inference stage, it needs to generate captions based on visual inputs.
The modality gap issue is widely observed in multi-modal contrastive models
that prevents us from directly taking the visual embedding as the prefix
embedding. We propose a training-free mechanism to reduce the modality gap. We
project the visual embedding into the CLIP text embedding space, while the
projected embedding retains the information of the visual input. Taking the
projected embedding as the prefix embedding, the decoder generates high-quality
descriptions that match the visual input. The experiments show that DeCap
outperforms other zero-shot captioning methods and unpaired captioning methods
on the typical image captioning benchmarks, i.e., MSCOCO and NoCaps.
Related papers
- DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning [13.601154787754046]
DRCap is a data-efficient and flexible zero-shot audio captioning system.
It requires text-only data for training and can quickly adapt to new domains without additional fine-tuning.
arXiv Detail & Related papers (2024-10-12T10:21:00Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - LocCa: Visual Pretraining with Location-aware Captioners [53.93041020615046]
We propose a simple visual pretraining method with location-aware captioners (LocCa)
LocCa uses a simple image captioner task interface to teach a model to read out rich information.
Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining.
arXiv Detail & Related papers (2024-03-28T17:20:39Z) - MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual
Captioning [108.12011636732674]
MultiCapCLIP can generate visual captions for different scenarios and languages without any labeled vision-caption pairs of downstream datasets.
Our method achieves 4.8% and 21.5% absolute improvements in terms of BLEU@4 and CIDEr metrics.
arXiv Detail & Related papers (2023-08-25T07:32:34Z) - Retrieval-augmented Image Captioning [15.266569206458648]
We present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore.
The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT.
Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
arXiv Detail & Related papers (2023-02-16T12:54:13Z) - CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification
without Concrete Text Labels [28.42405456691034]
We propose a two-stage strategy to facilitate a better visual representation in image re-identification tasks.
The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID.
The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks.
arXiv Detail & Related papers (2022-11-25T09:41:57Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - CoCa: Contrastive Captioners are Image-Text Foundation Models [41.759438751996505]
Contrastive Captioner (CoCa) is a minimalist design to pretrain an image-text encoder-decoder foundation model.
By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead.
CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks.
arXiv Detail & Related papers (2022-05-04T07:01:14Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.