Action Scene Graphs for Long-Form Understanding of Egocentric Videos
- URL: http://arxiv.org/abs/2312.03391v1
- Date: Wed, 6 Dec 2023 10:01:43 GMT
- Title: Action Scene Graphs for Long-Form Understanding of Egocentric Videos
- Authors: Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, Giovanni
Maria Farinella
- Abstract summary: We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos.
EASGs provide a temporally evolving graph-based description of the actions performed by the camera wearer.
We will release the dataset and the code to replicate experiments and annotations.
- Score: 23.058999979457546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Egocentric Action Scene Graphs (EASGs), a new representation for
long-form understanding of egocentric videos. EASGs extend standard
manually-annotated representations of egocentric videos, such as verb-noun
action labels, by providing a temporally evolving graph-based description of
the actions performed by the camera wearer, including interacted objects, their
relationships, and how actions unfold in time. Through a novel annotation
procedure, we extend the Ego4D dataset by adding manually labeled Egocentric
Action Scene Graphs offering a rich set of annotations designed for long-from
egocentric video understanding. We hence define the EASG generation task and
provide a baseline approach, establishing preliminary benchmarks. Experiments
on two downstream tasks, egocentric action anticipation and egocentric activity
summarization, highlight the effectiveness of EASGs for long-form egocentric
video understanding. We will release the dataset and the code to replicate
experiments and annotations.
Related papers
- Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding [27.881857222850083]
EgoExo-Fitness is a new full-body action understanding dataset.
It features fitness sequence videos recorded from synchronized egocentric and fixed exocentric cameras.
EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding.
arXiv Detail & Related papers (2024-06-13T07:28:45Z) - Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks.
Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning [27.661804052577825]
We introduce a novel problem -- egocentric action frame generation.
The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image.
arXiv Detail & Related papers (2023-12-06T19:02:40Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z) - Learning to Recognize Actions on Objects in Egocentric Video with
Attention Dictionaries [51.48859591280838]
We present EgoACO, a deep neural architecture for video action recognition.
It learns to pool action-context-object descriptors from frame level features.
Cap uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions.
arXiv Detail & Related papers (2021-02-16T10:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.