VIPHY: Probing "Visible" Physical Commonsense Knowledge
- URL: http://arxiv.org/abs/2209.07000v1
- Date: Thu, 15 Sep 2022 02:06:25 GMT
- Title: VIPHY: Probing "Visible" Physical Commonsense Knowledge
- Authors: Shikhar Singh, Ehsan Qasemi, Muhao Chen
- Abstract summary: Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks.
We evaluate their ability to acquire "visible" physical knowledge.
Our results indicate a severe gap between model and human performance.
- Score: 22.00069189468524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, vision-language models (VLMs) have shown remarkable
performance on visual reasoning tasks (e.g. attributes, location). While such
tasks measure the requisite knowledge to ground and reason over a given visual
instance, they do not, however, measure the ability of VLMs to retain and
generalize such knowledge. In this work, we evaluate their ability to acquire
"visible" physical knowledge -- the information that is easily accessible from
images of static scenes, particularly across the dimensions of object color,
size and space. We build an automatic pipeline to derive a comprehensive
knowledge resource for calibrating and probing these models. Our results
indicate a severe gap between model and human performance across all three
tasks. Furthermore, our caption pretrained baseline (CapBERT) significantly
outperforms VLMs on both size and spatial tasks -- highlighting that despite
sufficient access to ground language with visual modality, they struggle to
retain such knowledge. The dataset and code are available at
https://github.com/Axe--/ViPhy .
Related papers
- Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics.
We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images.
By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z) - AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times
and Location Reasoning [23.33600235294496]
Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings.
This makes us wonder if, based on visual cues, Vision-Language Models can achieve and even outperform human's capability in reasoning times and location.
We propose a two-stage recognitionspace and reasoningspace probing task, applied to discriminative and generative VLMs.
arXiv Detail & Related papers (2023-07-12T13:46:28Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - What Can You Learn from Your Muscles? Learning Visual Representation
from Human Interactions [50.435861435121915]
We use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations.
Our experiments show that our "muscly-supervised" representation outperforms a visual-only state-of-the-art method MoCo.
arXiv Detail & Related papers (2020-10-16T17:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.