Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos
- URL: http://arxiv.org/abs/2506.08334v1
- Date: Tue, 10 Jun 2025 01:41:46 GMT
- Title: Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos
- Authors: Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva,
- Abstract summary: We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera.<n>A casual captured video of an interaction with an articulated object is easy to acquire at scale using smartphones.<n>We introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video.
- Score: 53.47352228180637
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20$\times$ larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.
Related papers
- PickScan: Object discovery and reconstruction from handheld interactions [99.99566882133179]
We develop an interaction-guided and class-agnostic method to reconstruct 3D representations of scenes.
Our main contribution is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects.
Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73%.
arXiv Detail & Related papers (2024-11-17T23:09:08Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis [76.72505510632904]
We present Total-Recon, the first method to reconstruct deformable scenes from long monocular RGBD videos.
Our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into root-body motion and local articulations.
arXiv Detail & Related papers (2023-04-24T17:59:52Z) - Articulated 3D Human-Object Interactions from RGB Videos: An Empirical
Analysis of Approaches and Challenges [19.21834600205309]
We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video.
We use five families of methods for this task: 3D plane estimation, 3D cuboid estimation, CAD model fitting, implicit field fitting, and free-form mesh fitting.
Our experiments show that all methods struggle to obtain high accuracy results even when provided ground truth information.
arXiv Detail & Related papers (2022-09-12T21:03:25Z) - Class-agnostic Reconstruction of Dynamic Objects from Videos [127.41336060616214]
We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos.
We develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues.
Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation.
arXiv Detail & Related papers (2021-12-03T18:57:47Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - MOLTR: Multiple Object Localisation, Tracking, and Reconstruction from
Monocular RGB Videos [30.541606989348377]
MOLTR is a solution to object-centric mapping using only monocular image sequences and camera poses.
It is able to localise, track, and reconstruct multiple objects in an online fashion when an RGB camera captures a video of the surrounding.
We evaluate localisation, tracking, and reconstruction on benchmarking datasets for indoor and outdoor scenes.
arXiv Detail & Related papers (2020-12-09T23:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.