Ego-Only: Egocentric Action Detection without Exocentric Transferring
- URL: http://arxiv.org/abs/2301.01380v2
- Date: Fri, 19 May 2023 22:23:48 GMT
- Title: Ego-Only: Egocentric Action Detection without Exocentric Transferring
- Authors: Huiyu Wang, Mitesh Kumar Singh, Lorenzo Torresani
- Abstract summary: We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos.
- Score: 37.89647493482049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Ego-Only, the first approach that enables state-of-the-art action
detection on egocentric (first-person) videos without any form of exocentric
(third-person) transferring. Despite the content and appearance gap separating
the two domains, large-scale exocentric transferring has been the default
choice for egocentric action detection. This is because prior works found that
egocentric models are difficult to train from scratch and that transferring
from exocentric representations leads to improved accuracy. However, in this
paper, we revisit this common belief. Motivated by the large gap separating the
two domains, we propose a strategy that enables effective training of
egocentric models without exocentric transferring. Our Ego-Only approach is
simple. It trains the video representation with a masked autoencoder finetuned
for temporal segmentation. The learned features are then fed to an
off-the-shelf temporal action localization method to detect actions. We find
that this renders exocentric transferring unnecessary by showing remarkably
strong results achieved by this simple Ego-Only approach on three established
egocentric video datasets: Ego4D, EPIC-Kitchens-100, and Charades-Ego. On both
action detection and action recognition, Ego-Only outperforms previous best
exocentric transferring methods that use orders of magnitude more labels.
Ego-Only sets new state-of-the-art results on these datasets and benchmarks
without exocentric data.
Related papers
- Exocentric To Egocentric Transfer For Action Recognition: A Short Survey [25.41820386246096]
Egocentric vision captures the scene from the point of view of the camera wearer.
Exocentric vision captures the overall scene context.
Jointly modeling ego and exo views is crucial to developing next-generation AI agents.
arXiv Detail & Related papers (2024-10-27T22:38:51Z) - Ego3DT: Tracking Every 3D Object in Ego-centric Videos [20.96550148331019]
This paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video.
We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment.
We have also innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos.
arXiv Detail & Related papers (2024-10-11T05:02:31Z) - Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding [27.881857222850083]
EgoExo-Fitness is a new full-body action understanding dataset.
It features fitness sequence videos recorded from synchronized egocentric and fixed exocentric cameras.
EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding.
arXiv Detail & Related papers (2024-06-13T07:28:45Z) - Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - EgoPCA: A New Framework for Egocentric Hand-Object Interaction
Understanding [99.904140768186]
This paper proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA)
We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy.
We believe our data and the findings will pave a new way for Ego-HOI understanding.
arXiv Detail & Related papers (2023-09-05T17:51:16Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.