EgoTaskQA: Understanding Human Tasks in Egocentric Videos
- URL: http://arxiv.org/abs/2210.03929v1
- Date: Sat, 8 Oct 2022 05:49:05 GMT
- Title: EgoTaskQA: Understanding Human Tasks in Egocentric Videos
- Authors: Baoxiong Jia, Ting Lei, Song-Chun Zhu, Siyuan Huang
- Abstract summary: EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
- Score: 89.9573084127155
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding human tasks through video observations is an essential
capability of intelligent agents. The challenges of such capability lie in the
difficulty of generating a detailed understanding of situated actions, their
effects on object states (i.e., state changes), and their causal dependencies.
These challenges are further aggravated by the natural parallelism from
multi-tasking and partial observations in multi-agent collaboration. Most prior
works leverage action localization or future prediction as an indirect metric
for evaluating such task understanding from videos. To make a direct
evaluation, we introduce the EgoTaskQA benchmark that provides a single home
for the crucial dimensions of task understanding through question-answering on
real-world egocentric videos. We meticulously design questions that target the
understanding of (1) action dependencies and effects, (2) intents and goals,
and (3) agents' beliefs about others. These questions are divided into four
types, including descriptive (what status?), predictive (what will?),
explanatory (what caused?), and counterfactual (what if?) to provide diagnostic
analyses on spatial, temporal, and causal understandings of goal-oriented
tasks. We evaluate state-of-the-art video reasoning models on our benchmark and
show their significant gaps between humans in understanding complex
goal-oriented egocentric videos. We hope this effort will drive the vision
community to move onward with goal-oriented video understanding and reasoning.
Related papers
- STAR: A Benchmark for Situated Reasoning in Real-World Videos [94.78038233351758]
This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos.
The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility.
We propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning.
arXiv Detail & Related papers (2024-05-15T21:53:54Z) - A Backpack Full of Skills: Egocentric Video Understanding with Diverse
Task Perspectives [5.515192437680944]
We seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead.
We propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights.
We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks, outperforming current state-of-the-art methods.
arXiv Detail & Related papers (2024-03-05T15:18:02Z) - BDIQA: A New Dataset for Video Question Answering to Explore Cognitive
Reasoning through Theory of Mind [21.806678376095576]
Theory of mind (ToM) can make AI more closely resemble human thought processes.
Video question answer (VideoQA) datasets focus on studying causal reasoning within events few of them genuinely incorporating human ToM.
This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM.
arXiv Detail & Related papers (2024-02-12T04:34:19Z) - EgoTV: Egocentric Task Verification from Natural Language Task
Descriptions [9.503477434050858]
We propose a benchmark and a synthetic dataset called Egocentric Task Verification (EgoTV)
The goal in EgoTV is to verify the execution of tasks from egocentric videos based on the natural language description of these tasks.
We propose a novel Neuro-Symbolic Grounding (NSG) approach that leverages symbolic representations to capture the compositional and temporal structure of tasks.
arXiv Detail & Related papers (2023-03-29T19:16:49Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Egocentric Video Task Translation [109.30649877677257]
We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once.
Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition.
arXiv Detail & Related papers (2022-12-13T00:47:13Z) - Episodic Memory Question Answering [55.83870351196461]
We envision a scenario wherein the human communicates with an AI agent powering an augmented reality device by asking questions.
In order to succeed, the ego AI assistant must construct semantically rich and efficient scene memories.
We introduce a new task - Episodic Memory Question Answering (EMQA)
We show that our choice of episodic scene memory outperforms naive, off-the-centric solutions for the task as well as a host of very competitive baselines.
arXiv Detail & Related papers (2022-05-03T17:28:43Z) - HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem
Solving [104.79156980475686]
Humans learn compositional and causal abstraction, ie, knowledge, in response to the structure of naturalistic tasks.
We argue there shall be three levels of generalization in how an agent represents its knowledge: perceptual, conceptual, and algorithmic.
This benchmark is centered around a novel task domain, HALMA, for visual concept development and rapid problem-solving.
arXiv Detail & Related papers (2021-02-22T20:37:01Z) - DramaQA: Character-Centered Video Story Understanding with Hierarchical
QA [24.910132013543947]
We propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story.
Our dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips.
We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors and emotions of main characters.
arXiv Detail & Related papers (2020-05-07T09:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.