Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction
- URL: http://arxiv.org/abs/2512.00960v2
- Date: Sat, 06 Dec 2025 08:46:30 GMT
- Title: Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction
- Authors: Boran Wen, Ye Lu, Keyan Wan, Sirui Wang, Jiahong Zhou, Junxuan Liang, Xinpeng Liu, Bang Xiao, Dingbang Huang, Ruiyang Liu, Yong-Lu Li,
- Abstract summary: Generalized robots must learn from diverse, large-scale humanobject interactions (HOI) to robustly operate in the real world.<n>We introduce 4DHOISolver, a novel and efficient optimization framework that constrains ill-posed 4D HOI reconstruction problem.<n>We introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions.
- Score: 19.16200327159635
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/
Related papers
- ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors [51.06020148149403]
We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors.<n>ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded.
arXiv Detail & Related papers (2026-03-04T17:58:04Z) - MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction [54.36564144414704]
MeshMimic is an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video.<n>By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects.
arXiv Detail & Related papers (2026-02-17T17:09:45Z) - CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction [40.557276644446475]
We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos.<n>Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos.
arXiv Detail & Related papers (2025-12-12T19:11:11Z) - OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction [76.44108003274955]
A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning policies.<n>We introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh.<n>By minimizing the Laplacian deformation between the human and robot meshes, OmniRetarget generates kinematically feasible trajectories.
arXiv Detail & Related papers (2025-09-30T17:59:02Z) - InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation [54.09384502044162]
We introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements.<n>First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations.<n>Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions.<n>Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-09-11T15:43:54Z) - GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects [13.830968058014546]
GenHOI is a two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences.<n>We introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate 3D HOIs into densely temporally coherent 4D HOI sequences.<n> Experimental results show that we achieve state-of-the-art results on the publicly available OMOMODM and 3D-FUTURE datasets.
arXiv Detail & Related papers (2025-06-18T14:17:53Z) - CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement [24.287902864042792]
We present CORE4D, a novel large-scale 4D human-object collaboration dataset.<n>With 1K human-object-human motion sequences captured in the real world, we enrich CORE4D by contributing an iterative collaboration strategy to augment motions to a variety of novel objects.<n>Benefiting from extensive motion patterns provided by CORE4D, we benchmark two tasks aiming at generating human-object interaction: human-object motion forecasting and interaction synthesis.
arXiv Detail & Related papers (2024-06-27T17:32:18Z) - H4D: Human 4D Modeling by Learning Neural Compositional Representation [75.34798886466311]
This work presents a novel framework that can effectively learn a compact and compositional representation for dynamic human.
A simple yet effective linear motion model is proposed to provide a rough and regularized motion estimation.
Experiments demonstrate our method is not only efficacy in recovering dynamic human with accurate motion and detailed geometry, but also amenable to various 4D human related tasks.
arXiv Detail & Related papers (2022-03-02T17:10:49Z) - RobustFusion: Robust Volumetric Performance Reconstruction under
Human-object Interactions from Monocular RGBD Stream [27.600873320989276]
High-quality 4D reconstruction of human performance with complex interactions to various objects is essential in real-world scenarios.
Recent advances still fail to provide reliable performance reconstruction.
We propose RobustFusion, a robust volumetric performance reconstruction system for human-object interaction scenarios.
arXiv Detail & Related papers (2021-04-30T08:41:45Z) - Reconstructing Hand-Object Interactions in the Wild [71.16013096764046]
We propose an optimization-based procedure which does not require direct 3D supervision.
We exploit all available related data (2D bounding boxes, 2D hand keypoints, 2D instance masks, 3D object models, 3D in-the-lab MoCap) to provide constraints for the 3D reconstruction.
Our method produces compelling reconstructions on the challenging in-the-wild data from the EPIC Kitchens and the 100 Days of Hands datasets.
arXiv Detail & Related papers (2020-12-17T18:59:58Z) - HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular
Multi-Person 3D Pose Estimation [54.23770284299979]
This paper introduces a novel form of supervision - Hierarchical Multi-person Ordinal Relations (HMOR)
HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically.
An integrated top-down model is designed to leverage these ordinal relations in the learning process.
The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets.
arXiv Detail & Related papers (2020-08-01T07:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.