Ego-Only: Egocentric Action Detection without Exocentric Transferring
- URL: http://arxiv.org/abs/2301.01380v2
- Date: Fri, 19 May 2023 22:23:48 GMT
- Title: Ego-Only: Egocentric Action Detection without Exocentric Transferring
- Authors: Huiyu Wang, Mitesh Kumar Singh, Lorenzo Torresani
- Abstract summary: We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos.
- Score: 37.89647493482049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Ego-Only, the first approach that enables state-of-the-art action
detection on egocentric (first-person) videos without any form of exocentric
(third-person) transferring. Despite the content and appearance gap separating
the two domains, large-scale exocentric transferring has been the default
choice for egocentric action detection. This is because prior works found that
egocentric models are difficult to train from scratch and that transferring
from exocentric representations leads to improved accuracy. However, in this
paper, we revisit this common belief. Motivated by the large gap separating the
two domains, we propose a strategy that enables effective training of
egocentric models without exocentric transferring. Our Ego-Only approach is
simple. It trains the video representation with a masked autoencoder finetuned
for temporal segmentation. The learned features are then fed to an
off-the-shelf temporal action localization method to detect actions. We find
that this renders exocentric transferring unnecessary by showing remarkably
strong results achieved by this simple Ego-Only approach on three established
egocentric video datasets: Ego4D, EPIC-Kitchens-100, and Charades-Ego. On both
action detection and action recognition, Ego-Only outperforms previous best
exocentric transferring methods that use orders of magnitude more labels.
Ego-Only sets new state-of-the-art results on these datasets and benchmarks
without exocentric data.
Related papers
- EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding [27.881857222850083]
EgoExo-Fitness is a new full-body action understanding dataset.
It features fitness sequence videos recorded from synchronized egocentric and fixed exocentric cameras.
EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding.
arXiv Detail & Related papers (2024-06-13T07:28:45Z) - Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - EgoPCA: A New Framework for Egocentric Hand-Object Interaction
Understanding [99.904140768186]
This paper proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA)
We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy.
We believe our data and the findings will pave a new way for Ego-HOI understanding.
arXiv Detail & Related papers (2023-09-05T17:51:16Z) - Ego-Body Pose Estimation via Ego-Head Pose Estimation [22.08240141115053]
Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR.
We propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation.
This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion.
arXiv Detail & Related papers (2022-12-09T02:25:20Z) - Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks.
We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions.
We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.