Linear Alignment of Vision-language Models for Image Captioning
- URL: http://arxiv.org/abs/2307.05591v3
- Date: Tue, 6 Feb 2024 09:33:48 GMT
- Title: Linear Alignment of Vision-language Models for Image Captioning
- Authors: Fabian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler
- Abstract summary: We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP.
This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap.
We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT.
- Score: 9.746397419479447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, vision-language models like CLIP have advanced the state of the art
in a variety of multi-modal tasks including image captioning and caption
evaluation. Many approaches adapt CLIP-style models to a downstream task by
training a mapping network between CLIP and a language model. This is costly as
it usually involves calculating gradients for large models. We propose a more
efficient training protocol that fits a linear mapping between image and text
embeddings of CLIP via a closed-form solution. This bypasses the need for
gradient computation and results in a lightweight captioning method called
ReCap, which can be trained up to 1000 times faster than existing lightweight
methods. Moreover, we propose two new learning-based image-captioning metrics
that build on CLIP score along with our linear mapping. Furthermore, we combine
ReCap with our new metrics to design an iterative datastore-augmentation loop
(DAL) based on synthetic captions. We evaluate ReCap on MS-COCO, Flickr30k,
VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art
lightweight methods on established metrics while outperforming them on our new
metrics, which are better aligned with human ratings on Flickr8k-Expert and
Flickr8k-Crowdflower. Finally, we demonstrate that ReCap transfers well to
other domains and that our DAL leads to a performance boost.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.