Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
- URL: http://arxiv.org/abs/2405.16934v1
- Date: Mon, 27 May 2024 08:26:58 GMT
- Title: Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
- Authors: Zhenyang Li, Yangyang Guo, Kejie Wang, Xiaolin Chen, Liqiang Nie, Mohan Kankanhalli,
- Abstract summary: Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
- Score: 51.72751335574947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the important object-tag correlation. With these findings, we tentatively suggest some future directions from the aspect of dataset, evaluation metric, and training tricks. We believe this work could make researchers revisit the intuition and goals of VCR, and thus help tackle the remaining challenges in visual reasoning.
Related papers
- Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects [31.926206783846144]
We show that a Vision Transformer (ViT) fails dramatically on most ARC tasks even when trained on one million examples per task.
We propose ViTARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities required by the ARC.
Our task-specific ViTARC models achieve a test solve rate close to 100% on more than half of the 400 public ARC tasks.
arXiv Detail & Related papers (2024-10-08T22:25:34Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Visual Reasoning: from State to Transformation [80.32402545546209]
Existing visual reasoning tasks ignore an important factor, i.e.transformation.
We propose a novel textbftransformation driven visual reasoning (TVR) task.
We show that state-of-the-art visual reasoning models perform well on Basic, but are far from human-level intelligence on Event, View, and TRANCO.
arXiv Detail & Related papers (2023-05-02T14:24:12Z) - VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers [47.581265194864585]
Internal mechanisms of vision and multimodal transformers remain largely opaque.
With the success of these transformers, it is increasingly critical to understand their inner workings.
We propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers.
arXiv Detail & Related papers (2022-03-30T05:25:35Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning.
Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence.
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.