Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos
- URL: http://arxiv.org/abs/2403.06351v1
- Date: Mon, 11 Mar 2024 01:00:00 GMT
- Title: Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos
- Authors: Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman
- Abstract summary: Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
- Score: 66.46812056962567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate exocentric-to-egocentric cross-view translation, which aims to
generate a first-person (egocentric) view of an actor based on a video
recording that captures the actor from a third-person (exocentric) perspective.
To this end, we propose a generative framework called Exo2Ego that decouples
the translation process into two stages: high-level structure transformation,
which explicitly encourages cross-view correspondence between exocentric and
egocentric views, and a diffusion-based pixel-level hallucination, which
incorporates a hand layout prior to enhance the fidelity of the generated
egocentric view. To pave the way for future advancements in this field, we
curate a comprehensive exo-to-ego cross-view translation benchmark. It consists
of a diverse collection of synchronized ego-exo tabletop activity video pairs
sourced from three public datasets: H2O, Aria Pilot, and Assembly101. The
experimental results validate that Exo2Ego delivers photorealistic video
results with clear hand manipulation details and outperforms several baselines
in terms of both synthesis quality and generalization ability to new actions.
Related papers
- Exocentric To Egocentric Transfer For Action Recognition: A Short Survey [25.41820386246096]
Egocentric vision captures the scene from the point of view of the camera wearer.
Exocentric vision captures the overall scene context.
Jointly modeling ego and exo views is crucial to developing next-generation AI agents.
arXiv Detail & Related papers (2024-10-27T22:38:51Z) - Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - Intention-driven Ego-to-Exo Video Generation [16.942040396018736]
Ego-to-exo video generation refers to generating the corresponding exo-ego video according to the egocentric model.
This paper proposes an Intention-Driven-exo generation framework (IDE) that leverages action description as view-independent representation.
We conduct experiments on the relevant dataset with diverse exo-ego video pairs, and ourIDE outperforms state-of-the-art models in both subjective and objective assessments.
arXiv Detail & Related papers (2024-03-14T09:07:31Z) - POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object
Interaction in the Multi-View World [59.545114016224254]
Humans are good at translating third-person observations of hand-object interactions into an egocentric view.
We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
arXiv Detail & Related papers (2024-03-09T09:54:44Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Enhancing Egocentric 3D Pose Estimation with Third Person Views [37.9683439632693]
We propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera.
We introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives.
Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos.
arXiv Detail & Related papers (2022-01-06T11:42:01Z) - Cross-View Exocentric to Egocentric Video Synthesis [18.575642755375107]
Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view.
We propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information.
The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion.
arXiv Detail & Related papers (2021-07-07T10:00:52Z) - Ego-Exo: Transferring Visual Representations from Third-person to
First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets.
Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties.
Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.