ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos
- URL: http://arxiv.org/abs/2311.01620v1
- Date: Thu, 2 Nov 2023 22:17:03 GMT
- Title: ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life
Videos
- Authors: Te-Lin Wu, Zi-Yi Dou, Qingyuan Hu, Yu Hou, Nischal Reddy Chandra,
Marjorie Freedman, Ralph M. Weischedel, Nanyun Peng
- Abstract summary: ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints.
Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal.
We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
- Score: 53.92440577914417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal counterfactual reasoning is a vital yet challenging ability for AI
systems. It involves predicting the outcomes of hypothetical circumstances
based on vision and language inputs, which enables AI models to learn from
failures and explore hypothetical scenarios. Despite its importance, there are
only a few datasets targeting the counterfactual reasoning abilities of
multimodal models. Among them, they only cover reasoning over synthetic
environments or specific types of events (e.g. traffic collisions), making them
hard to reliably benchmark the model generalization ability in diverse
real-world scenarios and reasoning dimensions. To overcome these limitations,
we develop a video question answering dataset, ACQUIRED: it consists of 3.9K
annotated videos, encompassing a wide range of event types and incorporating
both first and third-person viewpoints, which ensures a focus on real-world
diversity. In addition, each video is annotated with questions that span three
distinct dimensions of reasoning, including physical, social, and temporal,
which can comprehensively evaluate the model counterfactual abilities along
multiple aspects. We benchmark our dataset against several state-of-the-art
language-only and multimodal models and experimental results demonstrate a
significant performance gap (>13%) between models and humans. The findings
suggest that multimodal counterfactual reasoning remains an open challenge and
ACQUIRED is a comprehensive and reliable benchmark for inspiring future
research in this direction.
Related papers
- Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.
We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z) - HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning [49.72868038180909]
We present WorldQA, a video dataset designed to push the boundaries of multimodal world models.
We identify five essential types of world knowledge for question formulation.
We introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain.
arXiv Detail & Related papers (2024-05-06T08:42:34Z) - Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models [22.0839948292609]
We introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern language models.
This dataset is constructed by infusing original questions with various types such as numerical and counter-language queries.
Our evaluations of contemporary vision models using this dataset have revealed substantial performance drops, with some models showing up to a 40% decrease.
arXiv Detail & Related papers (2023-10-10T13:45:59Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions [75.42526766746515]
We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs.
Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge.
Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
arXiv Detail & Related papers (2022-10-18T19:20:53Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Future Frame Prediction of a Video Sequence [5.660207256468971]
The ability to predict, anticipate and reason about future events is the essence of intelligence.
The ability to predict, anticipate and reason about future events is the essence of intelligence.
arXiv Detail & Related papers (2020-08-31T15:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.