Related papers: Articulated Object Estimation in the Wild

Articulated Object Estimation in the Wild

URL: http://arxiv.org/abs/2509.01708v1
Date: Mon, 01 Sep 2025 18:34:17 GMT
Title: Articulated Object Estimation in the Wild
Authors: Abdelrhman Werby, Martin Büchner, Adrian Röfer, Chenguang Huang, Wolfram Burgard, Abhinav Valada,
Abstract summary: ArtiPoint is a novel estimation framework that can infer articulated object models under dynamic camera motion and partial observability.<n>By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos.<n>We benchmark ArtiPoint against a range of classical and learning-based baselines, demonstrating its superior performance on Arti4D.
Score: 25.616481887384708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic unconstrained environments. In contrast, humans effortlessly infer articulation by watching others manipulate objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework that can infer articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset that captures articulated object interactions at a scene level, accompanied by articulation labels and ground-truth camera poses. We benchmark ArtiPoint against a range of classical and learning-based baselines, demonstrating its superior performance on Arti4D. We make code and Arti4D publicly available at https://artipoint.cs.uni-freiburg.de.

Related papers

Articulated 3D Scene Graphs for Open-World Mobile Manipulation [55.97942733699124]
We present MoMa-SG, a framework for building semantic-kinematic 3D scene graphs of articulated scenes.<n>We estimate articulation models using a novel unified twist estimation formulation.<n>We also introduce the novel Arti4D-Semantic dataset.
arXiv Detail & Related papers (2026-02-18T10:40:35Z)
ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos [48.24897274501108]
We introduce a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences.<n>Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level.<n>We show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes.
arXiv Detail & Related papers (2026-01-08T18:58:08Z)
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction [40.557276644446475]
We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos.<n>Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos.
arXiv Detail & Related papers (2025-12-12T19:11:11Z)
3D Foundation Models Enable Simultaneous Geometry and Pose Estimation of Grasped Objects [13.58353565350936]
We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot. Our method transforms the estimated geometry into the robot's coordinate frame. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
arXiv Detail & Related papers (2024-07-14T21:02:55Z)
DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z)
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z)
ROAM: Robust and Object-Aware Motion Generation Using Neural Pose Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object. We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object. We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z)
Visibility Aware Human-Object Interaction Tracking from Single RGB Camera [40.817960406002506]
We propose a novel method to track the 3D human, object, contacts between them, and their relative translation across frames from a single RGB camera. We condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence. Human and object motion from visible frames provides valuable information to infer the occluded object.
arXiv Detail & Related papers (2023-03-29T06:23:44Z)
ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation [68.80339307258835]
ARCTIC is a dataset of two hands that dexterously manipulate objects. It contains 2.1M video frames paired with accurate 3D hand meshes and detailed, dynamic contact information.
arXiv Detail & Related papers (2022-04-28T17:23:59Z)
D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions. Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints. We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z)
"What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences [27.915309216800125]
We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator. We propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge. Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data.
arXiv Detail & Related papers (2020-11-06T10:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.