PACS: A Dataset for Physical Audiovisual CommonSense Reasoning
- URL: http://arxiv.org/abs/2203.11130v1
- Date: Mon, 21 Mar 2022 17:05:23 GMT
- Title: PACS: A Dataset for Physical Audiovisual CommonSense Reasoning
- Authors: Samuel Yu, Peter Wu, Paul Pu Liang, Ruslan Salakhutdinov,
Louis-Philippe Morency
- Abstract summary: This paper contributes PACS: the first audiovisual benchmark annotated for physical commonsense attributes.
PACS contains a total of 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos.
Using PACS, we evaluate multiple state-of-the-art models on this new challenging task.
- Score: 119.0100966278682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order for AI to be safely deployed in real-world scenarios such as
hospitals, schools, and the workplace, they should be able to reason about the
physical world by understanding the physical properties and affordances of
available objects, how they can be manipulated, and how they interact with
other physical objects. This research field of physical commonsense reasoning
is fundamentally a multi-sensory task since physical properties are manifested
through multiple modalities, two of them being vision and acoustics. Our paper
takes a step towards real-world physical commonsense reasoning by contributing
PACS: the first audiovisual benchmark annotated for physical commonsense
attributes. PACS contains a total of 13,400 question-answer pairs, involving
1,377 unique physical commonsense questions and 1,526 videos. Our dataset
provides new opportunities to advance the research field of physical reasoning
by bringing audio as a core component of this multimodal problem. Using PACS,
we evaluate multiple state-of-the-art models on this new challenging task.
While some models show promising results (70% accuracy), they all fall short of
human performance (95% accuracy). We conclude the paper by demonstrating the
importance of multimodal reasoning and providing possible avenues for future
research.
Related papers
- Compositional Physical Reasoning of Objects and Events from Videos [122.6862357340911]
This paper addresses the challenge of inferring hidden physical properties from objects' motion and interactions.
We evaluate state-of-the-art video reasoning models on ComPhy and reveal their limited ability to capture these hidden properties.
We also propose a novel neuro-symbolic framework, Physical Concept Reasoner (PCR), that learns and reasons about both visible and hidden physical properties.
arXiv Detail & Related papers (2024-08-02T15:19:55Z) - ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense.
We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy.
We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z) - ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z) - ComPhy: Compositional Physical Reasoning of Objects and Events from
Videos [113.2646904729092]
The compositionality between the visible and hidden properties poses unique challenges for AI models to reason from the physical world.
Existing studies on video reasoning mainly focus on visually observable elements such as object appearance, movement, and contact interaction.
We propose an oracle neural-symbolic framework named Compositional Physics Learner (CPL), combining visual perception, physical property learning, dynamic prediction, and symbolic execution.
arXiv Detail & Related papers (2022-05-02T17:59:13Z) - Video Sentiment Analysis with Bimodal Information-augmented Multi-Head
Attention [7.997124140597719]
This study focuses on the sentiment analysis of videos containing time series data of multiple modalities.
The key problem is how to fuse these heterogeneous data.
Based on bimodal interaction, more important bimodal features are assigned larger weights.
arXiv Detail & Related papers (2021-03-03T12:30:11Z) - ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation [75.0278287071591]
ThreeDWorld (TDW) is a platform for interactive multi-modal physical simulation.
TDW enables simulation of high-fidelity sensory data and physical interactions between mobile agents and objects in rich 3D environments.
We present initial experiments enabled by TDW in emerging research directions in computer vision, machine learning, and cognitive science.
arXiv Detail & Related papers (2020-07-09T17:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.