Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times
and Location Reasoning
- URL: http://arxiv.org/abs/2307.06166v2
- Date: Fri, 29 Dec 2023 16:08:25 GMT
- Title: Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times
and Location Reasoning
- Authors: Gengyuan Zhang, Yurui Zhang, Kerui Zhang, Volker Tresp
- Abstract summary: Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings.
This makes us wonder if, based on visual cues, Vision-Language Models can achieve and even outperform human's capability in reasoning times and location.
We propose a two-stage recognitionspace and reasoningspace probing task, applied to discriminative and generative VLMs.
- Score: 23.33600235294496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) are expected to be capable of reasoning with
commonsense knowledge as human beings. One example is that humans can reason
where and when an image is taken based on their knowledge. This makes us wonder
if, based on visual cues, Vision-Language Models that are pre-trained with
large-scale image-text resources can achieve and even outperform human's
capability in reasoning times and location. To address this question, we
propose a two-stage \recognition\space and \reasoning\space probing task,
applied to discriminative and generative VLMs to uncover whether VLMs can
recognize times and location-relevant features and further reason about it. To
facilitate the investigation, we introduce WikiTiLo, a well-curated image
dataset compromising images with rich socio-cultural cues. In the extensive
experimental studies, we find that although VLMs can effectively retain
relevant features in visual encoders, they still fail to make perfect
reasoning. We will release our dataset and codes to facilitate future studies.
Related papers
- See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding [78.88461026069862]
Vision-language models (VLMs) can respond to queries about images in many languages.
We present a novel investigation that demonstrates and localizes Western bias in image understanding.
arXiv Detail & Related papers (2024-06-17T15:49:51Z) - ReMI: A Dataset for Reasoning with Multiple Images [41.954830849939526]
We introduce ReMI, a dataset designed to assess large language models' ability to Reason with Multiple Images.
This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning.
We have benchmarked several cutting-edge LLMs and found a substantial gap between their performance and human-level proficiency.
arXiv Detail & Related papers (2024-06-13T14:37:04Z) - An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology.
We introduce what VLMs are, how they work, and how to train them.
Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z) - IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models [21.589318022339317]
We present IllusionVQA: a dataset of challenging optical illusions and hard-to-interpret scenes.
Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization.
arXiv Detail & Related papers (2024-03-23T23:06:32Z) - CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z) - Large Language Models are Visual Reasoning Coordinators [144.67558375045755]
We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning.
We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering.
We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
arXiv Detail & Related papers (2023-10-23T17:59:31Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks.
We evaluate their ability to acquire "visible" physical knowledge.
Our results indicate a severe gap between model and human performance.
arXiv Detail & Related papers (2022-09-15T02:06:25Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.