BDIQA: A New Dataset for Video Question Answering to Explore Cognitive
Reasoning through Theory of Mind
- URL: http://arxiv.org/abs/2402.07402v1
- Date: Mon, 12 Feb 2024 04:34:19 GMT
- Title: BDIQA: A New Dataset for Video Question Answering to Explore Cognitive
Reasoning through Theory of Mind
- Authors: Yuanyuan Mao, Xin Lin, Qin Ni, Liang He
- Abstract summary: Theory of mind (ToM) can make AI more closely resemble human thought processes.
Video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM.
This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM.
- Score: 21.806678376095576
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: As a foundational component of cognitive intelligence, theory of mind (ToM)
can make AI more closely resemble human thought processes, thereby enhancing
their interaction and collaboration with human. In particular, it can
significantly improve a model's comprehension of videos in complex scenes.
However, current video question answer (VideoQA) datasets focus on studying
causal reasoning within events few of them genuinely incorporating human ToM.
Consequently, there is a lack of development in ToM reasoning tasks within the
area of VideoQA. This paper presents BDIQA, the first benchmark to explore the
cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA
is inspired by the cognitive development of children's ToM and addresses the
current deficiencies in machine ToM within datasets and tasks. Specifically, it
offers tasks at two difficulty levels, assessing Belief, Desire and Intention
(BDI) reasoning in both simple and complex scenarios. We conduct evaluations on
several mainstream methods of VideoQA and diagnose their capabilities with zero
shot, few shot and supervised learning. We find that the performance of
pre-trained models on cognitive reasoning tasks remains unsatisfactory. To
counter this challenge, we undertake thorough analysis and experimentation,
ultimately presenting two guidelines to enhance cognitive reasoning derived
from ablation analysis.
Related papers
- Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence.
Recent trends demonstrate the potential homogeneity of these two fields.
We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z) - Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads [74.54183505245553]
A systematic analysis of AI capabilities for joint vision and text reasoning is missing in the current scientific literature.
We evaluate state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads.
Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children.
arXiv Detail & Related papers (2024-06-22T05:04:39Z) - What is the Visual Cognition Gap between Humans and Multimodal LLMs? [22.99627171182423]
Multimodal Large Language Models (MLLMs) have shown great promise in language-guided tasks such as recognition, segmentation, and object detection.
One such challenge is abstract visual reasoning (AVR) -- the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns.
We propose new dataset MaRs-VQA and a new benchmark VCog-Bench to evaluate the zero-shot capability of MLLMs.
arXiv Detail & Related papers (2024-06-14T22:02:21Z) - Auxiliary task demands mask the capabilities of smaller language models [2.938889003635811]
We show that evaluation methods with greater task demands yield lower performance than evaluations with reduced demands.
Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence.
arXiv Detail & Related papers (2024-04-03T02:56:52Z) - OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities [19.83434949066066]
This paper introduces a novel intelligent framework, referred to as OlaGPT.
OlaGPT carefully studied a cognitive architecture framework, and propose to simulate certain aspects of human cognition.
The framework involves approximating different cognitive modules, including attention, memory, reasoning, learning, and corresponding scheduling and decision-making mechanisms.
arXiv Detail & Related papers (2023-05-23T09:36:51Z) - A Review on Machine Theory of Mind [16.967933605635203]
Theory of Mind (ToM) is the ability to attribute mental states to others, the basis of human cognition.
In this paper, we review recent progress in machine ToM on beliefs, desires, and intentions.
arXiv Detail & Related papers (2023-03-21T04:58:47Z) - Memory-Augmented Theory of Mind Network [59.9781556714202]
Social reasoning requires the capacity of theory of mind (ToM) to contextualise and attribute mental states to others.
Recent machine learning approaches to ToM have demonstrated that we can train the observer to read the past and present behaviours of other agents.
We tackle the challenges by equipping the observer with novel neural memory mechanisms to encode, and hierarchical attention to selectively retrieve information about others.
This results in ToMMY, a theory of mind model that learns to reason while making little assumptions about the underlying mental processes.
arXiv Detail & Related papers (2023-01-17T14:48:58Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem
Solving [104.79156980475686]
Humans learn compositional and causal abstraction, ie, knowledge, in response to the structure of naturalistic tasks.
We argue there shall be three levels of generalization in how an agent represents its knowledge: perceptual, conceptual, and algorithmic.
This benchmark is centered around a novel task domain, HALMA, for visual concept development and rapid problem-solving.
arXiv Detail & Related papers (2021-02-22T20:37:01Z) - HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.