CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
- URL: http://arxiv.org/abs/2512.11988v1
- Date: Fri, 12 Dec 2025 19:11:11 GMT
- Title: CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
- Authors: Xianghui Xie, Bowen Wen, Yan Chang, Hesam Rabeti, Jiefeng Li, Ye Yuan, Gerard Pons-Moll, Stan Birchfield,
- Abstract summary: We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos.<n>Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos.
- Score: 40.557276644446475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.
Related papers
- ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors [51.06020148149403]
We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors.<n>ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded.
arXiv Detail & Related papers (2026-03-04T17:58:04Z) - Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction [19.16200327159635]
Generalized robots must learn from diverse, large-scale humanobject interactions (HOI) to robustly operate in the real world.<n>We introduce 4DHOISolver, a novel and efficient optimization framework that constrains ill-posed 4D HOI reconstruction problem.<n>We introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions.
arXiv Detail & Related papers (2025-11-30T16:21:47Z) - HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images [27.025336665386735]
We propose a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images.<n>We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape.<n>Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance.
arXiv Detail & Related papers (2025-08-22T15:30:40Z) - TesserAct: Learning 4D Embodied World Models [66.8519958275311]
We learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos.<n>This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent.
arXiv Detail & Related papers (2025-04-29T17:59:30Z) - Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors [31.277540988829976]
This paper proposes a novel zero-shot HOI synthesis framework without relying on end-to-end training on currently limited 3D HOI datasets.<n>We employ pre-trained human pose estimation models to extract human poses and introduce a generalizable category-level 6-DoF estimation method to obtain the object poses from 2D HOI images.
arXiv Detail & Related papers (2025-03-25T23:55:47Z) - CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement [24.287902864042792]
We present CORE4D, a novel large-scale 4D human-object collaboration dataset.<n>With 1K human-object-human motion sequences captured in the real world, we enrich CORE4D by contributing an iterative collaboration strategy to augment motions to a variety of novel objects.<n>Benefiting from extensive motion patterns provided by CORE4D, we benchmark two tasks aiming at generating human-object interaction: human-object motion forecasting and interaction synthesis.
arXiv Detail & Related papers (2024-06-27T17:32:18Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - Decaf: Monocular Deformation Capture for Face and Hand Interactions [77.75726740605748]
This paper introduces the first method that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos.
We model hands as articulated objects inducing non-rigid face deformations during an active interaction.
Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system.
arXiv Detail & Related papers (2023-09-28T17:59:51Z) - Articulated 3D Human-Object Interactions from RGB Videos: An Empirical
Analysis of Approaches and Challenges [19.21834600205309]
We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video.
We use five families of methods for this task: 3D plane estimation, 3D cuboid estimation, CAD model fitting, implicit field fitting, and free-form mesh fitting.
Our experiments show that all methods struggle to obtain high accuracy results even when provided ground truth information.
arXiv Detail & Related papers (2022-09-12T21:03:25Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - CHORE: Contact, Human and Object REconstruction from a single RGB image [40.817960406002506]
CHORE is a novel method that learns to jointly reconstruct the human and the object from a single RGB image.
We compute a neural reconstruction of human and object represented implicitly with two unsigned distance fields.
Experiments show that our joint reconstruction learned with the proposed strategy significantly outperforms the SOTA.
arXiv Detail & Related papers (2022-04-05T18:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.