Related papers: Visual cognition in multimodal large language models

Visual cognition in multimodal large language models

URL: http://arxiv.org/abs/2311.16093v2
Date: Wed, 24 Jan 2024 11:03:49 GMT
Title: Visual cognition in multimodal large language models
Authors: Luca M. Schulze Buschoff, Elif Akata, Matthias Bethge, Eric Schulz
Abstract summary: This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology. Our findings reveal that, while these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas.
Score: 13.768104721550321
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A chief goal of artificial intelligence is to build machines that think like people. Yet it has been argued that deep neural network architectures fail to accomplish this. Researchers have asserted these models' limitations in the domains of causal reasoning, intuitive physics, and intuitive psychology. Yet recent advancements, namely the rise of large language models, particularly those designed for visual processing, have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based large language models in the domains of intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments, we investigate the extent to which these modern models grasp complex physical interactions, causal relationships, and intuitive understanding of others' preferences. Our findings reveal that, while these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas. The models exhibit a rudimentary understanding of physical laws and causal relationships, but their performance is hindered by a lack of deeper insights - a key aspect of human cognition. Furthermore, in tasks requiring an intuitive theory of mind, the models fail altogether. Our results emphasize the need for integrating more robust mechanisms for understanding causality, physical dynamics, and social cognition into modern-day, vision-based language models, and point out the importance of cognitively-inspired benchmarks.

Related papers

Reasoning in machine vision: learning to think fast and slow [10.430190333487957]
Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios.<n>Machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time.<n>Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time.
arXiv Detail & Related papers (2025-06-27T10:03:05Z)
Human-like Cognitive Generalization for Large Models via Brain-in-the-loop Supervision [22.553688605475333]
We show that brain-in-the-loop supervised learning can effectively transfer human conceptual structures to deep neural networks (DNNs)<n> Experimental results indicate that the enhanced cognitive capabilities lead to substantial performance gains in challenging tasks.<n>These findings highlight that human-in-the-loop supervision can effectively augment the complex cognitive abilities of large models.
arXiv Detail & Related papers (2025-05-14T02:39:10Z)
Testing the limits of fine-tuning to improve reasoning in vision language models [51.58859621164201]
We introduce visual stimuli and human judgments on visual cognition tasks to evaluate performance across cognitive domains. We fine-tune models on ground truth data for intuitive physics and causal reasoning. We find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
arXiv Detail & Related papers (2025-02-21T18:58:30Z)
From Imitation to Introspection: Probing Self-Consciousness in Language Models [8.357696451703058]
Self-consciousness is the introspection of one's existence and thoughts. This work presents a practical definition of self-consciousness for language models.
arXiv Detail & Related papers (2024-10-24T15:08:17Z)
Visual Knowledge in the Big Model Era: Retrospect and Prospect [63.282425615863]
Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner. As the knowledge about the visual world has been identified as an indispensable component of human cognition and intelligence, visual knowledge is poised to have a pivotal role in establishing machine intelligence.
arXiv Detail & Related papers (2024-04-05T07:31:24Z)
LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer. We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z)
A Neuro-mimetic Realization of the Common Model of Cognition via Hebbian Learning and Free Energy Minimization [55.11642177631929]
Large neural generative models are capable of synthesizing semantically rich passages of text or producing complex images. We discuss the COGnitive Neural GENerative system, such an architecture that casts the Common Model of Cognition.
arXiv Detail & Related papers (2023-10-14T23:28:48Z)
Turning large language models into cognitive models [0.0]
We show that large language models can be turned into cognitive models. These models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains. Taken together, these results suggest that large, pre-trained models can be adapted to become generalist cognitive models.
arXiv Detail & Related papers (2023-06-06T18:00:01Z)
Machine Psychology [54.287802134327485]
We argue that a fruitful direction for research is engaging large language models in behavioral experiments inspired by psychology. We highlight theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table. It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks.
arXiv Detail & Related papers (2023-03-24T13:24:41Z)
A Benchmark for Modeling Violation-of-Expectation in Physical Reasoning Across Event Categories [4.4920673251997885]
Violation-of-Expectation (VoE) is used to label scenes as either expected or surprising with knowledge of only expected scenes. Existing VoE-based 3D datasets in physical reasoning provide mainly vision data with little to no-truths or inductive biases. We set up a benchmark to study physical reasoning by curating a novel large-scale synthetic 3D VoE dataset armed with ground-truth labels of causally relevant features and rules.
arXiv Detail & Related papers (2021-11-16T22:59:25Z)
WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data. We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z)
Data augmentation and image understanding [2.123756175601459]
dissertation explores some advantageous synergies between machine learning, cognitive science and neuroscience. dissertation focuses on learning representations that are more aligned with visual perception and the biological vision.
arXiv Detail & Related papers (2020-12-28T11:00:52Z)
Machine Common Sense [77.34726150561087]
Machine common sense remains a broad, potentially unbounded problem in artificial intelligence (AI) This article deals with the aspects of modeling commonsense reasoning focusing on such domain as interpersonal interactions.
arXiv Detail & Related papers (2020-06-15T13:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.