Related papers: Action Scene Graphs for Long-Form Understanding of Egocentric Videos

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

URL: http://arxiv.org/abs/2312.03391v1
Date: Wed, 6 Dec 2023 10:01:43 GMT
Title: Action Scene Graphs for Long-Form Understanding of Egocentric Videos
Authors: Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, Giovanni Maria Farinella
Abstract summary: We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs provide a temporally evolving graph-based description of the actions performed by the camera wearer. We will release the dataset and the code to replicate experiments and annotations.
Score: 23.058999979457546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how actions unfold in time. Through a novel annotation procedure, we extend the Ego4D dataset by adding manually labeled Egocentric Action Scene Graphs offering a rich set of annotations designed for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach, establishing preliminary benchmarks. Experiments on two downstream tasks, egocentric action anticipation and egocentric activity summarization, highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and the code to replicate experiments and annotations.

Related papers

Keystep Recognition using Graph Neural Networks [11.421362760480527]
We propose a flexible graph-learning framework for keystep recognition in egocentric videos.<n>The constructed graphs are sparse and computationally efficient, outperforming existing larger models substantially.<n>We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods.
arXiv Detail & Related papers (2025-06-01T17:54:58Z)
Object-Shot Enhanced Grounding Network for Egocentric Video [60.97916755629796]
We propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video.<n>Specifically, we extract object information from videos to enrich video representation.<n>We analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information.
arXiv Detail & Related papers (2025-05-07T09:20:12Z)
EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos [49.24266108952835]
Given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. EgoExo-Gen explicitly models the hand-object dynamics for cross-view video prediction.
arXiv Detail & Related papers (2025-04-16T03:12:39Z)
Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition [11.421362760480527]
We propose a flexible graph-learning framework for fine-grained keystep recognition in egocentric videos. We show that our proposed framework notably outperforms existing methods by more than 12 points in accuracy. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph.
arXiv Detail & Related papers (2025-01-07T20:02:55Z)
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation [30.350824860817536]
EgoVid-5M is the first high-quality dataset curated for egocentric video generation. We introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals.
arXiv Detail & Related papers (2024-11-13T07:05:40Z)
Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning. Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities. By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z)
EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding [27.881857222850083]
EgoExo-Fitness is a new full-body action understanding dataset. It features fitness sequence videos recorded from synchronized egocentric and fixed exocentric cameras. EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding.
arXiv Detail & Related papers (2024-06-13T07:28:45Z)
Object Aware Egocentric Online Action Detection [23.504280692701272]
We introduce an Object-Aware Module that integrates egocentric-specific priors into existing Online Action Detection frameworks. Our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements.
arXiv Detail & Related papers (2024-06-03T07:58:40Z)
EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++. Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z)
Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos. We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z)
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning [27.661804052577825]
We introduce a novel problem -- egocentric action frame generation. The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image.
arXiv Detail & Related papers (2023-12-06T19:02:40Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks. We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions. We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries [51.48859591280838]
We present EgoACO, a deep neural architecture for video action recognition. It learns to pool action-context-object descriptors from frame level features. Cap uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions.
arXiv Detail & Related papers (2021-02-16T10:26:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.