CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties
via Video Question Answering
- URL: http://arxiv.org/abs/2211.03779v1
- Date: Mon, 7 Nov 2022 18:55:26 GMT
- Title: CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties
via Video Question Answering
- Authors: Maitreya Patel and Tejas Gokhale and Chitta Baral and Yezhou Yang
- Abstract summary: We introduce CRIPP-VQA, a new video question answering dataset for reasoning about the implicit physical properties of objects in a scene.
CRIPP-VQA contains videos of objects in motion, annotated with questions that involve counterfactual reasoning.
Our experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties.
- Score: 50.61988087577871
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Videos often capture objects, their visible properties, their motion, and the
interactions between different objects. Objects also have physical properties
such as mass, which the imaging pipeline is unable to directly capture.
However, these properties can be estimated by utilizing cues from relative
object motion and the dynamics introduced by collisions. In this paper, we
introduce CRIPP-VQA, a new video question answering dataset for reasoning about
the implicit physical properties of objects in a scene. CRIPP-VQA contains
videos of objects in motion, annotated with questions that involve
counterfactual reasoning about the effect of actions, questions about planning
in order to reach a goal, and descriptive questions about visible properties of
objects. The CRIPP-VQA test set enables evaluation under several
out-of-distribution settings -- videos with objects with masses, coefficients
of friction, and initial velocities that are not observed in the training
distribution. Our experiments reveal a surprising and significant performance
gap in terms of answering questions about implicit properties (the focus of
this paper) and explicit properties of objects (the focus of prior work).
Related papers
- Compositional Physical Reasoning of Objects and Events from Videos [122.6862357340911]
This paper addresses the challenge of inferring hidden physical properties from objects' motion and interactions.
We evaluate state-of-the-art video reasoning models on ComPhy and reveal their limited ability to capture these hidden properties.
We also propose a novel neuro-symbolic framework, Physical Concept Reasoner (PCR), that learns and reasons about both visible and hidden physical properties.
arXiv Detail & Related papers (2024-08-02T15:19:55Z) - ComPhy: Compositional Physical Reasoning of Objects and Events from
Videos [113.2646904729092]
The compositionality between the visible and hidden properties poses unique challenges for AI models to reason from the physical world.
Existing studies on video reasoning mainly focus on visually observable elements such as object appearance, movement, and contact interaction.
We propose an oracle neural-symbolic framework named Compositional Physics Learner (CPL), combining visual perception, physical property learning, dynamic prediction, and symbolic execution.
arXiv Detail & Related papers (2022-05-02T17:59:13Z) - Video Action Detection: Analysing Limitations and Challenges [70.01260415234127]
We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
arXiv Detail & Related papers (2022-04-17T00:42:14Z) - Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering [27.979053252431306]
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.
We propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture.
arXiv Detail & Related papers (2021-06-25T05:12:42Z) - Grounding Physical Concepts of Objects and Events Through Dynamic Visual
Reasoning [84.90458333884443]
We present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language.
DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these presentations for answering queries.
DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training.
arXiv Detail & Related papers (2021-03-30T17:59:48Z) - Object Properties Inferring from and Transfer for Human Interaction
Motions [51.896592493436984]
In this paper, we present a fine-grained action recognition method that learns to infer object properties from human interaction motion alone.
We collect a large number of videos and 3D skeletal motions of the performing actors using an inertial motion capture device.
In particular, we learn to identify the interacting object, by estimating its weight, or its fragility or delicacy.
arXiv Detail & Related papers (2020-08-20T14:36:34Z) - Foldover Features for Dynamic Object Behavior Description in Microscopic
Videos [4.194890536348037]
We propose foldover features to describe the behavior of dynamic objects in microscopic videos.
In the experiment, we use a sperm microscopic video dataset to evaluate the proposed foldover features, including three types of 1374 sperms.
arXiv Detail & Related papers (2020-03-19T08:39:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.