Self-Supervision on Images and Text Reduces Reliance on Visual Shortcut
Features
- URL: http://arxiv.org/abs/2206.07155v1
- Date: Tue, 14 Jun 2022 20:33:26 GMT
- Title: Self-Supervision on Images and Text Reduces Reliance on Visual Shortcut
Features
- Authors: Anil Palepu, Andrew L Beam
- Abstract summary: Shortcut features are inputs that are associated with the outcome of interest in the training data, but are either no longer associated or not present in testing or deployment settings.
We show that self-supervised models trained on images and text provide more robust image representations and reduce the model's reliance on visual shortcut features.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning models trained in a fully supervised manner have been shown to
rely on so-called "shortcut" features. Shortcut features are inputs that are
associated with the outcome of interest in the training data, but are either no
longer associated or not present in testing or deployment settings. Here we
provide experiments that show recent self-supervised models trained on images
and text provide more robust image representations and reduce the model's
reliance on visual shortcut features on a realistic medical imaging example.
Additionally, we find that these self-supervised models "forget" shortcut
features more quickly than fully supervised ones when fine-tuned on labeled
data. Though not a complete solution, our experiments provide compelling
evidence that self-supervised models trained on images and text provide some
resilience to visual shortcut features.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning [57.4893889206655]
We introduce synthetic shortcuts for vision-language: a training and evaluation framework.
We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut.
arXiv Detail & Related papers (2024-02-27T13:50:34Z) - No More Shortcuts: Realizing the Potential of Temporal Self-Supervision [69.59938105887538]
We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks.
We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
arXiv Detail & Related papers (2023-12-20T13:20:31Z) - On the Foundations of Shortcut Learning [20.53986437152018]
We study how predictivity and availability interact to shape models' feature use.
We find that linear models are relatively unbiased, but introducing a single hidden layer with ReLU or Tanh units yields a bias.
arXiv Detail & Related papers (2023-10-24T22:54:05Z) - Self-Supervised Multi-Object Tracking For Autonomous Driving From
Consistency Across Timescales [53.55369862746357]
Self-supervised multi-object trackers have tremendous potential as they enable learning from raw domain-specific data.
However, their re-identification accuracy still falls short compared to their supervised counterparts.
We propose a training objective that enables self-supervised learning of re-identification features from multiple sequential frames.
arXiv Detail & Related papers (2023-04-25T20:47:29Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Automatic Shortcut Removal for Self-Supervised Representation Learning [39.636691159890354]
In self-supervised visual representation learning, a feature extractor is trained on a "pretext task" for which labels can be generated cheaply, without human annotation.
Much work has gone into identifying such "shortcut" features and hand-designing schemes to reduce their effect.
We show that this assumption holds across common pretext tasks and datasets by training a "lens" network to make small image changes that maximally reduce performance in the pretext task.
arXiv Detail & Related papers (2020-02-20T16:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.