Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
- URL: http://arxiv.org/abs/2503.19706v2
- Date: Mon, 31 Mar 2025 08:46:51 GMT
- Title: Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
- Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn,
- Abstract summary: We propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment.<n>We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding.
- Score: 47.04855334955006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at https://github.com/park-jungin/byov.
Related papers
- EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos [49.24266108952835]
Given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video.
EgoExo-Gen explicitly models the hand-object dynamics for cross-view video prediction.
arXiv Detail & Related papers (2025-04-16T03:12:39Z) - Intention-driven Ego-to-Exo Video Generation [16.942040396018736]
Ego-to-exo video generation refers to generating the corresponding exo-ego video according to the egocentric model.
This paper proposes an Intention-Driven-exo generation framework (IDE) that leverages action description as view-independent representation.
We conduct experiments on the relevant dataset with diverse exo-ego video pairs, and ourIDE outperforms state-of-the-art models in both subjective and objective assessments.
arXiv Detail & Related papers (2024-03-14T09:07:31Z) - Put Myself in Your Shoes: Lifting the Egocentric Perspective from
Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective.
We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z) - POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object
Interaction in the Multi-View World [59.545114016224254]
Humans are good at translating third-person observations of hand-object interactions into an egocentric view.
We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
arXiv Detail & Related papers (2024-03-09T09:54:44Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos [25.910110689486952]
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning.<n>We adapt models from web instructional videos with exocentric views to an egocentric view.<n>Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views.
arXiv Detail & Related papers (2023-11-28T02:51:13Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.