Egocentric Video Task Translation
- URL: http://arxiv.org/abs/2212.06301v2
- Date: Thu, 6 Apr 2023 21:39:04 GMT
- Title: Egocentric Video Task Translation
- Authors: Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani
- Abstract summary: We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once.
Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition.
- Score: 109.30649877677257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Different video understanding tasks are typically treated in isolation, and
even with distinct types of curated data (e.g., classifying sports in one
dataset, tracking animals in another). However, in wearable cameras, the
immersive egocentric perspective of a person engaging with the world around
them presents an interconnected web of video understanding tasks -- hand-object
manipulations, navigation in the space, or human-human interactions -- that
unfold continuously, driven by the person's goals. We argue that this calls for
a much more unified approach. We propose EgoTask Translation (EgoT2), which
takes a collection of models optimized on separate tasks and learns to
translate their outputs for improved performance on any or all of them at once.
Unlike traditional transfer or multi-task learning, EgoT2's flipped design
entails separate task-specific backbones and a task translator shared across
all tasks, which captures synergies between even heterogeneous tasks and
mitigates task competition. Demonstrating our model on a wide array of video
tasks from Ego4D, we show its advantages over existing transfer paradigms and
achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.
Related papers
- Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [194.06650316685798]
Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities.
740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts.
Video accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions.
arXiv Detail & Related papers (2023-11-30T05:21:07Z) - Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels.
We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z) - EgoTV: Egocentric Task Verification from Natural Language Task
Descriptions [9.503477434050858]
We propose a benchmark and a synthetic dataset called Egocentric Task Verification (EgoTV)
The goal in EgoTV is to verify the execution of tasks from egocentric videos based on the natural language description of these tasks.
We propose a novel Neuro-Symbolic Grounding (NSG) approach that leverages symbolic representations to capture the compositional and temporal structure of tasks.
arXiv Detail & Related papers (2023-03-29T19:16:49Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Egocentric Video Task Translation @ Ego4D Challenge 2022 [109.30649877677257]
The EgoTask Translation approach explores relations among a set of egocentric video tasks in the Ego4D challenge.
We propose to leverage existing models developed for other related tasks and design a task that learns to ''translate'' auxiliary task features to the primary task.
Our proposed approach achieves competitive performance on two Ego4D challenges, ranking the 1st in the talking to me challenge and the 3rd in the PNR localization challenge.
arXiv Detail & Related papers (2023-02-03T18:05:49Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.