Related papers: Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

URL: http://arxiv.org/abs/2511.20186v1
Date: Tue, 25 Nov 2025 11:08:37 GMT
Title: Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis
Authors: Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, Luc Van Gool,
Abstract summary: Exo2EgoSyn is an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis.<n>Our framework consists of three key modules.
Score: 56.456085642852976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

Related papers

EgoX: Egocentric Video Generation from a Single Exocentric Video [46.41583107241048]
We present EgoX, a novel framework for generating egocentric videos from a single excentrico input.<n>Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen in-the-wild videos.
arXiv Detail & Related papers (2025-12-09T05:53:39Z)
WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation [51.1909041777449]
We present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation.<n> Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization.
arXiv Detail & Related papers (2025-11-27T04:40:37Z)
HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics [60.737929335600015]
We present textbfHumanGenesis, a framework that integrates geometric and generative modeling through four collaborative agents.<n>HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization.
arXiv Detail & Related papers (2025-08-13T14:50:19Z)
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations [47.04855334955006]
We propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment.<n>We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding.
arXiv Detail & Related papers (2025-03-25T14:33:32Z)
NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images [50.36605863731669]
NVComposer is a novel approach that eliminates the need for explicit external alignment.<n> NVComposer achieves state-of-the-art performance in generative multi-view NVS tasks.<n>Our approach shows substantial improvements in synthesis quality as the number of unposed input views increases.
arXiv Detail & Related papers (2024-12-04T17:58:03Z)
Intention-driven Ego-to-Exo Video Generation [16.942040396018736]
Ego-to-exo video generation refers to generating the corresponding exo-ego video according to the egocentric model. This paper proposes an Intention-Driven-exo generation framework (IDE) that leverages action description as view-independent representation. We conduct experiments on the relevant dataset with diverse exo-ego video pairs, and ourIDE outperforms state-of-the-art models in both subjective and objective assessments.
arXiv Detail & Related papers (2024-03-14T09:07:31Z)
Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos [66.46812056962567]
Exocentric-to-egocentric cross-view translation aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. We propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation and a pixel-level hallucination.
arXiv Detail & Related papers (2024-03-11T01:00:00Z)
3D Human Pose Perception from Egocentric Stereo Videos [67.9563319914377]
We propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.
arXiv Detail & Related papers (2023-12-30T21:21:54Z)
Cross-View Exocentric to Egocentric Video Synthesis [18.575642755375107]
Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. We propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information. The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion.
arXiv Detail & Related papers (2021-07-07T10:00:52Z)
Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.