A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One
Amplifies Others
- URL: http://arxiv.org/abs/2212.04825v2
- Date: Tue, 21 Mar 2023 17:13:58 GMT
- Title: A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One
Amplifies Others
- Authors: Zhiheng Li, Ivan Evtimov, Albert Gordo, Caner Hazirbas, Tal Hassner,
Cristian Canton Ferrer, Chenliang Xu, Mark Ibrahim
- Abstract summary: Key to advancing reliability of vision systems is understanding whether existing methods can overcome multiple shortcuts or struggle in a Whac-A-Mole game.
We find computer vision models, including large foundation models, struggle when multiple shortcuts are present.
We propose Last Layer Ensemble, a simple-yet-effective method to mitigate multiple shortcuts without Whac-A-Mole behavior.
- Score: 48.11387483887109
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machine learning models have been found to learn shortcuts -- unintended
decision rules that are unable to generalize -- undermining models'
reliability. Previous works address this problem under the tenuous assumption
that only a single shortcut exists in the training data. Real-world images are
rife with multiple visual cues from background to texture. Key to advancing the
reliability of vision systems is understanding whether existing methods can
overcome multiple shortcuts or struggle in a Whac-A-Mole game, i.e., where
mitigating one shortcut amplifies reliance on others. To address this
shortcoming, we propose two benchmarks: 1) UrbanCars, a dataset with precisely
controlled spurious cues, and 2) ImageNet-W, an evaluation set based on
ImageNet for watermark, a shortcut we discovered affects nearly every modern
vision model. Along with texture and background, ImageNet-W allows us to study
multiple shortcuts emerging from training on natural images. We find computer
vision models, including large foundation models -- regardless of training set,
architecture, and supervision -- struggle when multiple shortcuts are present.
Even methods explicitly designed to combat shortcuts struggle in a Whac-A-Mole
dilemma. To tackle this challenge, we propose Last Layer Ensemble, a
simple-yet-effective method to mitigate multiple shortcuts without Whac-A-Mole
behavior. Our results surface multi-shortcut mitigation as an overlooked
challenge critical to advancing the reliability of vision systems. The datasets
and code are released: https://github.com/facebookresearch/Whac-A-Mole.
Related papers
- MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts [14.610244867640471]
Recent vision-language models are driven by large-scale pretrained models.
We introduce a parameter-efficient method to address challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language.
Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency.
arXiv Detail & Related papers (2023-09-27T18:00:09Z) - Which Shortcut Solution Do Question Answering Models Prefer to Learn? [38.36299280464046]
Question answering (QA) models for reading comprehension tend to learn shortcut solutions rather than the solutions intended by QA datasets.
We show that shortcuts that exploit answer positions and word-label correlations are preferentially learned for extractive and multiple-choice QA.
We experimentally show that the learnability of shortcuts can be utilized to construct an effective QA training set.
arXiv Detail & Related papers (2022-11-29T13:57:59Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method.
It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task.
Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z) - Self-Supervision on Images and Text Reduces Reliance on Visual Shortcut
Features [0.0]
Shortcut features are inputs that are associated with the outcome of interest in the training data, but are either no longer associated or not present in testing or deployment settings.
We show that self-supervised models trained on images and text provide more robust image representations and reduce the model's reliance on visual shortcut features.
arXiv Detail & Related papers (2022-06-14T20:33:26Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - Why Machine Reading Comprehension Models Learn Shortcuts? [56.629192589376046]
We argue that larger proportion of shortcut questions in training data make models rely on shortcut tricks excessively.
A thorough empirical analysis shows that MRC models tend to learn shortcut questions earlier than challenging questions.
arXiv Detail & Related papers (2021-06-02T08:43:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.