EgoTwin: Dreaming Body and View in First Person
- URL: http://arxiv.org/abs/2508.13013v1
- Date: Mon, 18 Aug 2025 15:33:09 GMT
- Title: EgoTwin: Dreaming Body and View in First Person
- Authors: Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, Ziwei Liu,
- Abstract summary: EgoTwin is a joint video-motion generation framework built on the diffusion transformer architecture.<n>EgoTwin anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism.<n>For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets.
- Score: 47.06226050137047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.
Related papers
- Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures [33.2764643227486]
Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability.<n>We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion.<n>This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation.
arXiv Detail & Related papers (2026-02-10T09:51:07Z) - EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation [84.37917777533963]
We present EgoReAct, the first framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time.<n>EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods.
arXiv Detail & Related papers (2025-12-28T06:44:05Z) - EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z) - Dexterous World Models [24.21588354488453]
Dexterous World Model (DWM) is a scene-action-conditioned video diffusion framework.<n>We show how DWM generates temporally coherent videos depicting plausible human-scene interactions.<n>Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects.
arXiv Detail & Related papers (2025-12-19T18:59:51Z) - UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation [21.70816226149573]
We introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis.<n>We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices.<n>UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image.
arXiv Detail & Related papers (2025-08-02T00:41:20Z) - PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z) - VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.<n>VideoJAM achieves state-of-the-art performance in motion coherence.<n>These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - Intention-driven Ego-to-Exo Video Generation [16.942040396018736]
Ego-to-exo video generation refers to generating the corresponding exo-ego video according to the egocentric model.
This paper proposes an Intention-Driven-exo generation framework (IDE) that leverages action description as view-independent representation.
We conduct experiments on the relevant dataset with diverse exo-ego video pairs, and ourIDE outperforms state-of-the-art models in both subjective and objective assessments.
arXiv Detail & Related papers (2024-03-14T09:07:31Z) - EgoGen: An Egocentric Synthetic Data Generator [53.32942235801499]
EgoGen is a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks.
At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment.
We demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views.
arXiv Detail & Related papers (2024-01-16T18:55:22Z) - LEO: Generative Latent Image Animator for Human Video Synthesis [38.99490968487773]
We propose a novel framework for human video synthesis, placing emphasis on synthesizing-temporal coherency.
Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance.
We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM)
arXiv Detail & Related papers (2023-05-06T09:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.