Related papers: BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind

URL: http://arxiv.org/abs/2402.07402v1
Date: Mon, 12 Feb 2024 04:34:19 GMT
Title: BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind
Authors: Yuanyuan Mao, Xin Lin, Qin Ni, Liang He
Abstract summary: Theory of mind (ToM) can make AI more closely resemble human thought processes. Video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM.
Score: 21.806678376095576
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zero shot, few shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis.

Related papers

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks. Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events. We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question. This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions. We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z)
VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition [19.215440092652507]
We introduce VCBench, a controllable benchmark to assess cognitive abilities involving symbolic and abstract concepts. By generating video data with the Python-based engine, VCBench allows for precise control over the video content. Our evaluation reveals that even state-of-the-art (SOTA) models, such as Qwen2-VL-72B, struggle with simple video cognition tasks involving abstract concepts.
arXiv Detail & Related papers (2024-11-14T00:26:26Z)
Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence. Recent trends demonstrate the potential homogeneity of these two fields. We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z)
Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads [74.54183505245553]
A systematic analysis of AI capabilities for joint vision and text reasoning is missing in the current scientific literature. We evaluate state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children.
arXiv Detail & Related papers (2024-06-22T05:04:39Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
What is the Visual Cognition Gap between Humans and Multimodal LLMs? [22.99627171182423]
Multimodal Large Language Models (MLLMs) have shown great promise in language-guided tasks such as recognition, segmentation, and object detection. One such challenge is abstract visual reasoning (AVR) -- the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. We propose new dataset MaRs-VQA and a new benchmark VCog-Bench to evaluate the zero-shot capability of MLLMs.
arXiv Detail & Related papers (2024-06-14T22:02:21Z)
OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities [19.83434949066066]
This paper introduces a novel intelligent framework, referred to as OlaGPT. OlaGPT carefully studied a cognitive architecture framework, and propose to simulate certain aspects of human cognition. The framework involves approximating different cognitive modules, including attention, memory, reasoning, learning, and corresponding scheduling and decision-making mechanisms.
arXiv Detail & Related papers (2023-05-23T09:36:51Z)
A Review on Machine Theory of Mind [16.967933605635203]
Theory of Mind (ToM) is the ability to attribute mental states to others, the basis of human cognition. In this paper, we review recent progress in machine ToM on beliefs, desires, and intentions.
arXiv Detail & Related papers (2023-03-21T04:58:47Z)
EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z)
HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving [104.79156980475686]
Humans learn compositional and causal abstraction, ie, knowledge, in response to the structure of naturalistic tasks. We argue there shall be three levels of generalization in how an agent represents its knowledge: perceptual, conceptual, and algorithmic. This benchmark is centered around a novel task domain, HALMA, for visual concept development and rapid problem-solving.
arXiv Detail & Related papers (2021-02-22T20:37:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.