Improved baselines for vision-language pre-training
- URL: http://arxiv.org/abs/2305.08675v2
- Date: Sat, 4 Nov 2023 17:55:12 GMT
- Title: Improved baselines for vision-language pre-training
- Authors: Enrico Fini and Pietro Astolfi and Adriana Romero-Soriano and Jakob
Verbeek and Michal Drozdzal
- Abstract summary: We propose, implement and evaluate several baselines obtained by combining contrastive learning with self-supervised learning.
We find that these baselines outperform a basic implementation of CLIP.
We find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks.
- Score: 26.395527650984025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning has emerged as an efficient framework to learn
multimodal representations. CLIP, a seminal work in this area, achieved
impressive results by training on paired image-text data using the contrastive
loss. Recent work claims improvements over CLIP using additional
non-contrastive losses inspired from self-supervised learning. However, it is
sometimes hard to disentangle the contribution of these additional losses from
other implementation details, e.g., data augmentation or regularization
techniques, used to train the model. To shed light on this matter, in this
paper, we first propose, implement and evaluate several baselines obtained by
combining contrastive learning with recent advances in self-supervised
learning. In particular, we use the loss functions that were proven successful
for visual self-supervised learning to align image and text modalities. We find
that these baselines outperform a basic implementation of CLIP. However, when a
stronger training recipe is employed, the advantage disappears. Indeed, we find
that a simple CLIP baseline can also be improved substantially, up to a 25%
relative improvement on downstream zero-shot tasks, by using well-known
training techniques that are popular in other subfields. Moreover, we discover
that it is enough to apply image and text augmentations to make up for most of
the improvement attained by prior works. With our improved training recipe for
CLIP, we obtain state-of-the-art performance on four standard datasets, and
consistently outperform prior work (up to +4% on the largest dataset), while
being substantially simpler. The code is available at
https://github.com/facebookresearch/clip-rocket
Related papers
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - A Simple-but-effective Baseline for Training-free Class-Agnostic
Counting [30.792198686654075]
Class-Agnostic Counting (CAC) seeks to accurately count objects in a given image with only a few reference examples.
Recent efforts have shown that it's possible to accomplish this without training by utilizing pre-existing foundation models.
We present a training-free solution that effectively bridges this performance gap, serving as a strong baseline.
arXiv Detail & Related papers (2024-03-03T07:19:50Z) - A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity.
Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks.
We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z) - Continual Contrastive Spoken Language Understanding [33.09005399967931]
COCONUT is a class-incremental learning (CIL) method that relies on the combination of experience replay and contrastive learning.
We show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.
arXiv Detail & Related papers (2023-10-04T10:09:12Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - Pretext-Contrastive Learning: Toward Good Practices in Self-supervised
Video Representation Leaning [43.002621928500425]
We propose a joint optimization framework to boost both pretext task and contrastive learning.
It is convenient to treat PCL as a standard training strategy and apply it to many other works in self-supervised video feature learning.
arXiv Detail & Related papers (2020-10-29T10:20:35Z) - Rethinking Few-Shot Image Classification: a Good Embedding Is All You
Need? [72.00712736992618]
We show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, outperforms state-of-the-art few-shot learning methods.
An additional boost can be achieved through the use of self-distillation.
We believe that our findings motivate a rethinking of few-shot image classification benchmarks and the associated role of meta-learning algorithms.
arXiv Detail & Related papers (2020-03-25T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.