Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos
- URL: http://arxiv.org/abs/2506.08334v1
- Date: Tue, 10 Jun 2025 01:41:46 GMT
- Title: Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos
- Authors: Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva,
- Abstract summary: We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera.<n>A casual captured video of an interaction with an articulated object is easy to acquire at scale using smartphones.<n>We introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video.
- Score: 53.47352228180637
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20$\times$ larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.
Related papers
- Articulated 3D Scene Graphs for Open-World Mobile Manipulation [55.97942733699124]
We present MoMa-SG, a framework for building semantic-kinematic 3D scene graphs of articulated scenes.<n>We estimate articulation models using a novel unified twist estimation formulation.<n>We also introduce the novel Arti4D-Semantic dataset.
arXiv Detail & Related papers (2026-02-18T10:40:35Z) - sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only [20.99905717289565]
We present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera.<n>Our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding.<n>Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments.
arXiv Detail & Related papers (2025-12-08T16:38:30Z) - VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video [60.63575135514847]
Building digital twins of articulated objects from monocular video presents an essential challenge in computer vision.<n>We introduce VideoArtGS, a novel approach that reconstructs high-fidelity digital twins of articulated objects from monocular video.<n>VideoArtGS demonstrates state-of-the-art performance in articulation and mesh reconstruction, reducing the reconstruction error by about two orders of magnitude compared to existing methods.
arXiv Detail & Related papers (2025-09-22T11:52:02Z) - PickScan: Object discovery and reconstruction from handheld interactions [99.99566882133179]
We develop an interaction-guided and class-agnostic method to reconstruct 3D representations of scenes.
Our main contribution is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects.
Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73%.
arXiv Detail & Related papers (2024-11-17T23:09:08Z) - HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and
Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video.
We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images.
Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z) - Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis [76.72505510632904]
We present Total-Recon, the first method to reconstruct deformable scenes from long monocular RGBD videos.
Our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into root-body motion and local articulations.
arXiv Detail & Related papers (2023-04-24T17:59:52Z) - InterCap: Joint Markerless 3D Tracking of Humans and Objects in
Interaction [0.0]
InterCap reconstructs whole-bodies and objects from multi-view RGB-D data.
Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture system.
InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images.
arXiv Detail & Related papers (2022-09-26T00:46:49Z) - Articulated 3D Human-Object Interactions from RGB Videos: An Empirical
Analysis of Approaches and Challenges [19.21834600205309]
We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video.
We use five families of methods for this task: 3D plane estimation, 3D cuboid estimation, CAD model fitting, implicit field fitting, and free-form mesh fitting.
Our experiments show that all methods struggle to obtain high accuracy results even when provided ground truth information.
arXiv Detail & Related papers (2022-09-12T21:03:25Z) - Class-agnostic Reconstruction of Dynamic Objects from Videos [127.41336060616214]
We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos.
We develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues.
Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation.
arXiv Detail & Related papers (2021-12-03T18:57:47Z) - Estimating 3D Motion and Forces of Human-Object Interactions from
Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video.
Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - MOLTR: Multiple Object Localisation, Tracking, and Reconstruction from
Monocular RGB Videos [30.541606989348377]
MOLTR is a solution to object-centric mapping using only monocular image sequences and camera poses.
It is able to localise, track, and reconstruct multiple objects in an online fashion when an RGB camera captures a video of the surrounding.
We evaluate localisation, tracking, and reconstruction on benchmarking datasets for indoor and outdoor scenes.
arXiv Detail & Related papers (2020-12-09T23:15:08Z) - MoreFusion: Multi-object Reasoning for 6D Pose Estimation from
Volumetric Fusion [19.034317851914725]
We present a system which can estimate the accurate poses of multiple known objects in contact and occlusion from real-time, embodied multi-view vision.
Our approach makes 3D object pose proposals from single RGB-D views, accumulates pose estimates and non-parametric occupancy information from multiple views as the camera moves.
We verify the accuracy and robustness of our approach experimentally on 2 object datasets: YCB-Video, and our own challenging Cluttered YCB-Video.
arXiv Detail & Related papers (2020-04-09T02:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.