EgoTV: Egocentric Task Verification from Natural Language Task
Descriptions
- URL: http://arxiv.org/abs/2303.16975v5
- Date: Mon, 25 Sep 2023 19:20:58 GMT
- Title: EgoTV: Egocentric Task Verification from Natural Language Task
Descriptions
- Authors: Rishi Hazra, Brian Chen, Akshara Rai, Nitin Kamra, Ruta Desai
- Abstract summary: We propose a benchmark and a synthetic dataset called Egocentric Task Verification (EgoTV)
The goal in EgoTV is to verify the execution of tasks from egocentric videos based on the natural language description of these tasks.
We propose a novel Neuro-Symbolic Grounding (NSG) approach that leverages symbolic representations to capture the compositional and temporal structure of tasks.
- Score: 9.503477434050858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To enable progress towards egocentric agents capable of understanding
everyday tasks specified in natural language, we propose a benchmark and a
synthetic dataset called Egocentric Task Verification (EgoTV). The goal in
EgoTV is to verify the execution of tasks from egocentric videos based on the
natural language description of these tasks. EgoTV contains pairs of videos and
their task descriptions for multi-step tasks -- these tasks contain multiple
sub-task decompositions, state changes, object interactions, and sub-task
ordering constraints. In addition, EgoTV also provides abstracted task
descriptions that contain only partial details about ways to accomplish a task.
Consequently, EgoTV requires causal, temporal, and compositional reasoning of
video and language modalities, which is missing in existing datasets. We also
find that existing vision-language models struggle at such all round reasoning
needed for task verification in EgoTV. Inspired by the needs of EgoTV, we
propose a novel Neuro-Symbolic Grounding (NSG) approach that leverages symbolic
representations to capture the compositional and temporal structure of tasks.
We demonstrate NSG's capability towards task tracking and verification on our
EgoTV dataset and a real-world dataset derived from CrossTask (CTV). We
open-source the EgoTV and CTV datasets and the NSG model for future research on
egocentric assistive agents.
Related papers
- Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video [18.14234312389889]
We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions.
We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images.
The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
arXiv Detail & Related papers (2024-07-18T18:55:56Z) - EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation [54.32133648259802]
We present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo.
This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions.
arXiv Detail & Related papers (2024-06-26T05:01:37Z) - EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding [27.881857222850083]
EgoExo-Fitness is a new full-body action understanding dataset.
It features fitness sequence videos recorded from synchronized egocentric and fixed exocentric cameras.
EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding.
arXiv Detail & Related papers (2024-06-13T07:28:45Z) - EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World [44.34800426136217]
We introduce EgoExoLearn, a dataset that emulates the human demonstration following process.
EgoExoLearn contains egocentric and demonstration video data spanning 120 hours.
We present benchmarks such as cross-view association, cross-view action planning, and cross-view referenced skill assessment.
arXiv Detail & Related papers (2024-03-24T15:00:44Z) - Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels.
We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Egocentric Video Task Translation [109.30649877677257]
We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once.
Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition.
arXiv Detail & Related papers (2022-12-13T00:47:13Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.