Related papers: Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

URL: http://arxiv.org/abs/2602.18422v1
Date: Fri, 20 Feb 2026 18:45:29 GMT
Title: Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control
Authors: Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein,
Abstract summary: We introduce a human-centric video world model conditioned on both tracked head pose and joint-level hand poses.<n>We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments.
Score: 35.371152222595555
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Related papers

Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures [33.2764643227486]
Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability.<n>We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion.<n>This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation.
arXiv Detail & Related papers (2026-02-10T09:51:07Z)
Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z)
EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer [64.69014756863331]
We introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion.<n>We also propose MVS-RoPE, which offers unified 3D positional encoding for both video and motion tokens.<n>Our findings reveal that explicitly representing human motion is to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
arXiv Detail & Related papers (2025-12-21T17:08:14Z)
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation [18.328135509017944]
We propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations.<n>This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios.<n>Our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos.
arXiv Detail & Related papers (2025-12-01T13:44:31Z)
Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos [24.111891848073288]
Embodied world models aim to predict and interact with the physical world through visual observations and actions.<n>MTV-World introduces Multi-view Trajectory-Video control for precise visuomotor prediction.<n>MTV-World achieves precise control execution and accurate physical interaction modeling in complex dual-arm scenarios.
arXiv Detail & Related papers (2025-11-17T02:17:04Z)
Learning to Generate Object Interactions with Physics-Guided Video Diffusion [28.191514920144456]
We introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects.<n>We propose a two-stage training strategy that gradually removes future motion supervision via object masks.<n>Experiments show that KineMask achieves strong improvements over recent models of comparable size.
arXiv Detail & Related papers (2025-10-02T17:56:46Z)
MoReact: Generating Reactive Motion from Textual Descriptions [57.642436102978245]
MoReact is a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially.<n>Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2025-09-28T14:31:41Z)
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
EgoGen: An Egocentric Synthetic Data Generator [53.32942235801499]
EgoGen is a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. We demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views.
arXiv Detail & Related papers (2024-01-16T18:55:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.