Related papers: EgoTwin: Dreaming Body and View in First Person

EgoTwin: Dreaming Body and View in First Person

URL: http://arxiv.org/abs/2508.13013v1
Date: Mon, 18 Aug 2025 15:33:09 GMT
Title: EgoTwin: Dreaming Body and View in First Person
Authors: Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, Ziwei Liu,
Abstract summary: EgoTwin is a joint video-motion generation framework built on the diffusion transformer architecture.<n>EgoTwin anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism.<n>For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets.
Score: 47.06226050137047
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.

Related papers

Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures [33.2764643227486]
Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability.<n>We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion.<n>This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation.
arXiv Detail & Related papers (2026-02-10T09:51:07Z)
EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation [84.37917777533963]
We present EgoReAct, the first framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time.<n>EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods.
arXiv Detail & Related papers (2025-12-28T06:44:05Z)
EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z)
Dexterous World Models [24.21588354488453]
Dexterous World Model (DWM) is a scene-action-conditioned video diffusion framework.<n>We show how DWM generates temporally coherent videos depicting plausible human-scene interactions.<n>Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects.
arXiv Detail & Related papers (2025-12-19T18:59:51Z)
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation [21.70816226149573]
We introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis.<n>We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices.<n>UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image.
arXiv Detail & Related papers (2025-08-02T00:41:20Z)
PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z)
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.<n>VideoJAM achieves state-of-the-art performance in motion coherence.<n>These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z)
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction. We present a novel motion-decoupled framework to generate co-speech gesture videos. Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z)
Intention-driven Ego-to-Exo Video Generation [16.942040396018736]
Ego-to-exo video generation refers to generating the corresponding exo-ego video according to the egocentric model. This paper proposes an Intention-Driven-exo generation framework (IDE) that leverages action description as view-independent representation. We conduct experiments on the relevant dataset with diverse exo-ego video pairs, and ourIDE outperforms state-of-the-art models in both subjective and objective assessments.
arXiv Detail & Related papers (2024-03-14T09:07:31Z)
EgoGen: An Egocentric Synthetic Data Generator [53.32942235801499]
EgoGen is a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. We demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views.
arXiv Detail & Related papers (2024-01-16T18:55:22Z)
LEO: Generative Latent Image Animator for Human Video Synthesis [38.99490968487773]
We propose a novel framework for human video synthesis, placing emphasis on synthesizing-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM)
arXiv Detail & Related papers (2023-05-06T09:29:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.