Visual cognition in multimodal large language models
- URL: http://arxiv.org/abs/2311.16093v2
- Date: Wed, 24 Jan 2024 11:03:49 GMT
- Title: Visual cognition in multimodal large language models
- Authors: Luca M. Schulze Buschoff, Elif Akata, Matthias Bethge, Eric Schulz
- Abstract summary: This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology.
Our findings reveal that, while these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas.
- Score: 13.768104721550321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A chief goal of artificial intelligence is to build machines that think like
people. Yet it has been argued that deep neural network architectures fail to
accomplish this. Researchers have asserted these models' limitations in the
domains of causal reasoning, intuitive physics, and intuitive psychology. Yet
recent advancements, namely the rise of large language models, particularly
those designed for visual processing, have rekindled interest in the potential
to emulate human-like cognitive abilities. This paper evaluates the current
state of vision-based large language models in the domains of intuitive
physics, causal reasoning, and intuitive psychology. Through a series of
controlled experiments, we investigate the extent to which these modern models
grasp complex physical interactions, causal relationships, and intuitive
understanding of others' preferences. Our findings reveal that, while these
models demonstrate a notable proficiency in processing and interpreting visual
data, they still fall short of human capabilities in these areas. The models
exhibit a rudimentary understanding of physical laws and causal relationships,
but their performance is hindered by a lack of deeper insights - a key aspect
of human cognition. Furthermore, in tasks requiring an intuitive theory of
mind, the models fail altogether. Our results emphasize the need for
integrating more robust mechanisms for understanding causality, physical
dynamics, and social cognition into modern-day, vision-based language models,
and point out the importance of cognitively-inspired benchmarks.
Related papers
- Visual Knowledge in the Big Model Era: Retrospect and Prospect [63.282425615863]
Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner.
As the knowledge about the visual world has been identified as an indispensable component of human cognition and intelligence, visual knowledge is poised to have a pivotal role in establishing machine intelligence.
arXiv Detail & Related papers (2024-04-05T07:31:24Z) - A Neuro-mimetic Realization of the Common Model of Cognition via Hebbian
Learning and Free Energy Minimization [55.11642177631929]
Large neural generative models are capable of synthesizing semantically rich passages of text or producing complex images.
We discuss the COGnitive Neural GENerative system, such an architecture that casts the Common Model of Cognition.
arXiv Detail & Related papers (2023-10-14T23:28:48Z) - Eight challenges in developing theory of intelligence [3.0349733976070024]
A good theory of mathematical beauty is more practical than any current observation, as new predictions of physical reality can be verified self-consistently.
Here, we shed light on eight challenges in developing theory of intelligence following this theoretical paradigm.
arXiv Detail & Related papers (2023-06-20T01:45:42Z) - Turning large language models into cognitive models [0.0]
We show that large language models can be turned into cognitive models.
These models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains.
Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models.
arXiv Detail & Related papers (2023-06-06T18:00:01Z) - Causal Reasoning Meets Visual Representation Learning: A Prospective
Study [117.08431221482638]
Lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models.
Inspired by the strong inference ability of human-level agents, recent years have witnessed great effort in developing causal reasoning paradigms.
This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods.
arXiv Detail & Related papers (2022-04-26T02:22:28Z) - A Benchmark for Modeling Violation-of-Expectation in Physical Reasoning
Across Event Categories [4.4920673251997885]
Violation-of-Expectation (VoE) is used to label scenes as either expected or surprising with knowledge of only expected scenes.
Existing VoE-based 3D datasets in physical reasoning provide mainly vision data with little to no-truths or inductive biases.
We set up a benchmark to study physical reasoning by curating a novel large-scale synthetic 3D VoE dataset armed with ground-truth labels of causally relevant features and rules.
arXiv Detail & Related papers (2021-11-16T22:59:25Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z) - AGENT: A Benchmark for Core Psychological Reasoning [60.35621718321559]
Intuitive psychology is the ability to reason about hidden mental variables that drive observable actions.
Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning.
We present a benchmark consisting of procedurally generated 3D animations, AGENT, structured around four scenarios.
arXiv Detail & Related papers (2021-02-24T14:58:23Z) - Causal World Models by Unsupervised Deconfounding of Physical Dynamics [20.447000858907646]
The capability of imagining internally with a mental model of the world is vitally important for human cognition.
We propose Causal World Models (CWMs) that allow unsupervised modeling of relationships between the intervened and alternative futures.
We show reductions in complexity sample for reinforcement learning tasks and improvements in counterfactual physical reasoning.
arXiv Detail & Related papers (2020-12-28T13:44:36Z) - Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI)
This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.