Demonstrating and Reducing Shortcuts in Vision-Language Representation
Learning
- URL: http://arxiv.org/abs/2402.17510v1
- Date: Tue, 27 Feb 2024 13:50:34 GMT
- Title: Demonstrating and Reducing Shortcuts in Vision-Language Representation
Learning
- Authors: Maurits Bleeker, Mariya Hendriksen, Andrew Yates, Maarten de Rijke
- Abstract summary: We introduce synthetic shortcuts for vision-language: a training and evaluation framework.
We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut.
- Score: 62.80302738628635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) mainly rely on contrastive training to learn
general-purpose representations of images and captions. We focus on the
situation when one image is associated with several captions, each caption
containing both information shared among all captions and unique information
per caption about the scene depicted in the image. In such cases, it is unclear
whether contrastive losses are sufficient for learning task-optimal
representations that contain all the information provided by the captions or
whether the contrastive learning setup encourages the learning of a simple
shortcut that minimizes contrastive loss. We introduce synthetic shortcuts for
vision-language: a training and evaluation framework where we inject synthetic
shortcuts into image-text data. We show that contrastive VLMs trained from
scratch or fine-tuned with data containing these synthetic shortcuts mainly
learn features that represent the shortcut. Hence, contrastive losses are not
sufficient to learn task-optimal representations, i.e., representations that
contain all task-relevant information shared between the image and associated
captions. We examine two methods to reduce shortcut learning in our training
and evaluation framework: (i) latent target decoding and (ii) implicit feature
modification. We show empirically that both methods improve performance on the
evaluation task, but only partly reduce shortcut learning when training and
evaluating with our shortcut learning framework. Hence, we show the difficulty
and challenge of our shortcut learning framework for contrastive
vision-language representation learning.
Related papers
- SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - VL-LTR: Learning Class-wise Visual-Linguistic Representation for
Long-Tailed Visual Recognition [61.75391989107558]
We present a visual-linguistic long-tailed recognition framework, termed VL-LTR.
Our method can learn visual representation from images and corresponding linguistic representation from noisy class-level text descriptions.
Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points.
arXiv Detail & Related papers (2021-11-26T16:24:03Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.