BDIQA: A New Dataset for Video Question Answering to Explore Cognitive
Reasoning through Theory of Mind
- URL: http://arxiv.org/abs/2402.07402v1
- Date: Mon, 12 Feb 2024 04:34:19 GMT
- Title: BDIQA: A New Dataset for Video Question Answering to Explore Cognitive
Reasoning through Theory of Mind
- Authors: Yuanyuan Mao, Xin Lin, Qin Ni, Liang He
- Abstract summary: Theory of mind (ToM) can make AI more closely resemble human thought processes.
Video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM.
This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM.
- Score: 21.806678376095576
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: As a foundational component of cognitive intelligence, theory of mind (ToM)
can make AI more closely resemble human thought processes, thereby enhancing
their interaction and collaboration with human. In particular, it can
significantly improve a model's comprehension of videos in complex scenes.
However, current video question answer (VideoQA) datasets focus on studying
causal reasoning within events few of them genuinely incorporating human ToM.
Consequently, there is a lack of development in ToM reasoning tasks within the
area of VideoQA. This paper presents BDIQA, the first benchmark to explore the
cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA
is inspired by the cognitive development of children's ToM and addresses the
current deficiencies in machine ToM within datasets and tasks. Specifically, it
offers tasks at two difficulty levels, assessing Belief, Desire and Intention
(BDI) reasoning in both simple and complex scenarios. We conduct evaluations on
several mainstream methods of VideoQA and diagnose their capabilities with zero
shot, few shot and supervised learning. We find that the performance of
pre-trained models on cognitive reasoning tasks remains unsatisfactory. To
counter this challenge, we undertake thorough analysis and experimentation,
ultimately presenting two guidelines to enhance cognitive reasoning derived
from ablation analysis.
Related papers
- Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.
This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.
We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z) - VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition [19.215440092652507]
We introduce VCBench, a controllable benchmark to assess cognitive abilities involving symbolic and abstract concepts.
By generating video data with the Python-based engine, VCBench allows for precise control over the video content.
Our evaluation reveals that even state-of-the-art (SOTA) models, such as Qwen2-VL-72B, struggle with simple video cognition tasks involving abstract concepts.
arXiv Detail & Related papers (2024-11-14T00:26:26Z) - Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence.
Recent trends demonstrate the potential homogeneity of these two fields.
We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z) - Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads [74.54183505245553]
A systematic analysis of AI capabilities for joint vision and text reasoning is missing in the current scientific literature.
We evaluate state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads.
Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children.
arXiv Detail & Related papers (2024-06-22T05:04:39Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - What is the Visual Cognition Gap between Humans and Multimodal LLMs? [22.99627171182423]
Multimodal Large Language Models (MLLMs) have shown great promise in language-guided tasks such as recognition, segmentation, and object detection.
One such challenge is abstract visual reasoning (AVR) -- the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns.
We propose new dataset MaRs-VQA and a new benchmark VCog-Bench to evaluate the zero-shot capability of MLLMs.
arXiv Detail & Related papers (2024-06-14T22:02:21Z) - OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities [19.83434949066066]
This paper introduces a novel intelligent framework, referred to as OlaGPT.
OlaGPT carefully studied a cognitive architecture framework, and propose to simulate certain aspects of human cognition.
The framework involves approximating different cognitive modules, including attention, memory, reasoning, learning, and corresponding scheduling and decision-making mechanisms.
arXiv Detail & Related papers (2023-05-23T09:36:51Z) - A Review on Machine Theory of Mind [16.967933605635203]
Theory of Mind (ToM) is the ability to attribute mental states to others, the basis of human cognition.
In this paper, we review recent progress in machine ToM on beliefs, desires, and intentions.
arXiv Detail & Related papers (2023-03-21T04:58:47Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem
Solving [104.79156980475686]
Humans learn compositional and causal abstraction, ie, knowledge, in response to the structure of naturalistic tasks.
We argue there shall be three levels of generalization in how an agent represents its knowledge: perceptual, conceptual, and algorithmic.
This benchmark is centered around a novel task domain, HALMA, for visual concept development and rapid problem-solving.
arXiv Detail & Related papers (2021-02-22T20:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.