Does Spatial Cognition Emerge in Frontier Models?
- URL: http://arxiv.org/abs/2410.06468v1
- Date: Wed, 9 Oct 2024 01:41:49 GMT
- Title: Does Spatial Cognition Emerge in Frontier Models?
- Authors: Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Kraehenbuehl, Vladlen Koltun,
- Abstract summary: We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models.
Results suggest that contemporary frontier models fall short of the spatial intelligence of animals.
- Score: 56.47912101304053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.
Related papers
- Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces [34.809309396448654]
We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs.
We find that Multimodal Large Language Models (MLLMs) exhibit competitive - though subhuman - visual-spatial intelligence.
arXiv Detail & Related papers (2024-12-18T18:59:54Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.
We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.
Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
These findings expose critical blind spots in existing models and underscore the need for more advanced spatial reasoning techniques.
arXiv Detail & Related papers (2024-12-17T09:10:55Z) - Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction.
Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z) - Exploring Spatial Schema Intuitions in Large Language and Vision Models [8.944921398608063]
We investigate whether large language models (LLMs) effectively capture implicit human intuitions about building blocks of language.
Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences.
This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and computations made by large language models.
arXiv Detail & Related papers (2024-02-01T19:25:50Z) - Generative Models as a Complex Systems Science: How can we make sense of
large language model behavior? [75.79305790453654]
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP.
We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance.
arXiv Detail & Related papers (2023-07-31T22:58:41Z) - Things not Written in Text: Exploring Spatial Commonsense from Visual
Signals [77.46233234061758]
We investigate whether models with visual signals learn more spatial commonsense than text-based models.
We propose a benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions.
We find that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models.
arXiv Detail & Related papers (2022-03-15T17:02:30Z) - The Right Spin: Learning Object Motion from Rotation-Compensated Flow
Fields [61.664963331203666]
How humans perceive moving objects is a longstanding research question in computer vision.
One approach to the problem is to teach a deep network to model all of these effects.
We present a novel probabilistic model to estimate the camera's rotation given the motion field.
arXiv Detail & Related papers (2022-02-28T22:05:09Z) - Self-supervised Secondary Landmark Detection via 3D Representation
Learning [13.157012771922801]
We present a method to learn the spatial relationship of the primary and secondary landmarks in three dimensional space.
This learning can be applied to various multiview settings across diverse organisms, including macaques, flies, and humans.
arXiv Detail & Related papers (2021-10-01T17:15:47Z) - VisualEchoes: Spatial Image Representation Learning through Echolocation [97.23789910400387]
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation.
We propose a novel interaction-based representation learning framework that learns useful visual features via echolocation.
Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
arXiv Detail & Related papers (2020-05-04T16:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.