Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
- URL: http://arxiv.org/abs/2602.09600v2
- Date: Fri, 13 Feb 2026 08:39:01 GMT
- Title: Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
- Authors: Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, Xingang Pan,
- Abstract summary: Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability.<n>We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion.<n>This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation.
- Score: 33.2764643227486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.
Related papers
- Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control [35.371152222595555]
We introduce a human-centric video world model conditioned on both tracked head pose and joint-level hand poses.<n>We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments.
arXiv Detail & Related papers (2026-02-20T18:45:29Z) - 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation [29.389246008057473]
2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis.<n>3DiMo trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens.<n>Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control.
arXiv Detail & Related papers (2026-02-03T17:59:09Z) - EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation [84.37917777533963]
We present EgoReAct, the first framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time.<n>EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods.
arXiv Detail & Related papers (2025-12-28T06:44:05Z) - Dexterous World Models [24.21588354488453]
Dexterous World Model (DWM) is a scene-action-conditioned video diffusion framework.<n>We show how DWM generates temporally coherent videos depicting plausible human-scene interactions.<n>Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects.
arXiv Detail & Related papers (2025-12-19T18:59:51Z) - EgoTwin: Dreaming Body and View in First Person [47.06226050137047]
EgoTwin is a joint video-motion generation framework built on the diffusion transformer architecture.<n>EgoTwin anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism.<n>For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets.
arXiv Detail & Related papers (2025-08-18T15:33:09Z) - PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z) - SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios [48.09735396455107]
Hand-Object Interaction (HOI) generation has significant application potential.<n>Current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data.<n>We propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously.
arXiv Detail & Related papers (2025-06-03T05:04:29Z) - HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z) - Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation.
We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits.
Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z) - Decoupling Dynamic Monocular Videos for Dynamic View Synthesis [50.93409250217699]
We tackle the challenge of dynamic view synthesis from dynamic monocular videos in an unsupervised fashion.
Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints.
arXiv Detail & Related papers (2023-04-04T11:25:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.