Related papers: Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

URL: http://arxiv.org/abs/2307.06166v2
Date: Fri, 29 Dec 2023 16:08:25 GMT
Title: Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning
Authors: Gengyuan Zhang, Yurui Zhang, Kerui Zhang, Volker Tresp
Abstract summary: Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. This makes us wonder if, based on visual cues, Vision-Language Models can achieve and even outperform human's capability in reasoning times and location. We propose a two-stage recognitionspace and reasoningspace probing task, applied to discriminative and generative VLMs.
Score: 23.33600235294496
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. One example is that humans can reason where and when an image is taken based on their knowledge. This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human's capability in reasoning times and location. To address this question, we propose a two-stage \recognition\space and \reasoning\space probing task, applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features and further reason about it. To facilitate the investigation, we introduce WikiTiLo, a well-curated image dataset compromising images with rich socio-cultural cues. In the extensive experimental studies, we find that although VLMs can effectively retain relevant features in visual encoders, they still fail to make perfect reasoning. We will release our dataset and codes to facilitate future studies.

Related papers

Vision language models are unreliable at trivial spatial cognition [0.2902243522110345]
Vision language models (VLMs) are designed to extract relevant visuospatial information from images. We develop a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use equivalent descriptions.
arXiv Detail & Related papers (2025-04-22T17:38:01Z)
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining [86.76706820098867]
We introduce a high-quality textbfmultimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment.
arXiv Detail & Related papers (2025-01-01T21:29:37Z)
Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?" Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z)
ReMI: A Dataset for Reasoning with Multiple Images [41.954830849939526]
We introduce ReMI, a dataset designed to assess large language models' ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. We have benchmarked several cutting-edge LLMs and found a substantial gap between their performance and human-level proficiency.
arXiv Detail & Related papers (2024-06-13T14:37:04Z)
An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology. We introduce what VLMs are, how they work, and how to train them. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z)
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces. VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z)
IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models [21.589318022339317]
We present IllusionVQA: a dataset of challenging optical illusions and hard-to-interpret scenes. Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization.
arXiv Detail & Related papers (2024-03-23T23:06:32Z)
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text. We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z)
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z)
VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks. We evaluate their ability to acquire "visible" physical knowledge. Our results indicate a severe gap between model and human performance.
arXiv Detail & Related papers (2022-09-15T02:06:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.