AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
- URL: http://arxiv.org/abs/2602.04672v1
- Date: Wed, 04 Feb 2026 15:42:58 GMT
- Title: AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
- Authors: Jin-Chuan Shi, Binhong Ye, Tao Liu, Junzhe He, Yangjinhui Xu, Xiaoyang Liu, Zeju Li, Hao Chen, Chunhua Shen,
- Abstract summary: We introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning.<n>We show that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses.
- Score: 45.753757870577196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications.
Related papers
- MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z) - Masked Modeling for Human Motion Recovery Under Occlusions [21.05382087890133]
MoRo is an end-to-end generative framework that formulates motion reconstruction as a video-conditioned task.<n>MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
arXiv Detail & Related papers (2026-01-22T16:22:20Z) - ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning [19.292101162897975]
We introduce ByteLoom, a framework that generates realistic HOI videos with geometrically consistent object illustration.<n>We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency.<n>We then design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh.
arXiv Detail & Related papers (2025-12-28T09:38:36Z) - SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation [64.3409486422946]
We present SpriteHand, an autoregressive video generation framework for real-time synthesis of hand-object interaction videos.<n>Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence.<n> Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
arXiv Detail & Related papers (2025-12-01T18:13:40Z) - PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z) - Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects [59.51185639557874]
We introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions.<n>Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry.
arXiv Detail & Related papers (2025-11-03T07:21:42Z) - Puppeteer: Rig and Animate Your 3D Models [105.11046762553121]
Puppeteer is a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects.<n>Our system first predicts plausible skeletal structures via an auto-regressive transformer.<n>It then infers skinning weights via an attention-based architecture.
arXiv Detail & Related papers (2025-08-14T17:59:31Z) - SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios [48.09735396455107]
Hand-Object Interaction (HOI) generation has significant application potential.<n>Current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data.<n>We propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously.
arXiv Detail & Related papers (2025-06-03T05:04:29Z) - Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control [72.00655365269]
We present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation.<n>Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction.<n>Our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
arXiv Detail & Related papers (2025-06-02T17:57:06Z) - MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation [28.75149480374178]
MEgoHand is a framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose.<n>It achieves substantial reductions in wrist translation error and joint rotation error, highlighting its capacity to accurately model fine-grained hand joint structures.
arXiv Detail & Related papers (2025-05-22T12:37:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.