Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times
and Location Reasoning
- URL: http://arxiv.org/abs/2307.06166v2
- Date: Fri, 29 Dec 2023 16:08:25 GMT
- Title: Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times
and Location Reasoning
- Authors: Gengyuan Zhang, Yurui Zhang, Kerui Zhang, Volker Tresp
- Abstract summary: Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings.
This makes us wonder if, based on visual cues, Vision-Language Models can achieve and even outperform human's capability in reasoning times and location.
We propose a two-stage recognitionspace and reasoningspace probing task, applied to discriminative and generative VLMs.
- Score: 23.33600235294496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) are expected to be capable of reasoning with
commonsense knowledge as human beings. One example is that humans can reason
where and when an image is taken based on their knowledge. This makes us wonder
if, based on visual cues, Vision-Language Models that are pre-trained with
large-scale image-text resources can achieve and even outperform human's
capability in reasoning times and location. To address this question, we
propose a two-stage \recognition\space and \reasoning\space probing task,
applied to discriminative and generative VLMs to uncover whether VLMs can
recognize times and location-relevant features and further reason about it. To
facilitate the investigation, we introduce WikiTiLo, a well-curated image
dataset compromising images with rich socio-cultural cues. In the extensive
experimental studies, we find that although VLMs can effectively retain
relevant features in visual encoders, they still fail to make perfect
reasoning. We will release our dataset and codes to facilitate future studies.
Related papers
- Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - ReMI: A Dataset for Reasoning with Multiple Images [41.954830849939526]
We introduce ReMI, a dataset designed to assess large language models' ability to Reason with Multiple Images.
This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning.
We have benchmarked several cutting-edge LLMs and found a substantial gap between their performance and human-level proficiency.
arXiv Detail & Related papers (2024-06-13T14:37:04Z) - An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology.
We introduce what VLMs are, how they work, and how to train them.
Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z) - IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models [21.589318022339317]
We present IllusionVQA: a dataset of challenging optical illusions and hard-to-interpret scenes.
Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization.
arXiv Detail & Related papers (2024-03-23T23:06:32Z) - CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks.
We evaluate their ability to acquire "visible" physical knowledge.
Our results indicate a severe gap between model and human performance.
arXiv Detail & Related papers (2022-09-15T02:06:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.