sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only
- URL: http://arxiv.org/abs/2512.07698v1
- Date: Mon, 08 Dec 2025 16:38:30 GMT
- Title: sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only
- Authors: Arslan Artykov, Corentin Sautier, Vincent Lepetit,
- Abstract summary: We present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera.<n>Our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding.<n>Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments.
- Score: 20.99905717289565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/
Related papers
- Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects [59.51185639557874]
We introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions.<n>Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry.
arXiv Detail & Related papers (2025-11-03T07:21:42Z) - VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video [60.63575135514847]
Building digital twins of articulated objects from monocular video presents an essential challenge in computer vision.<n>We introduce VideoArtGS, a novel approach that reconstructs high-fidelity digital twins of articulated objects from monocular video.<n>VideoArtGS demonstrates state-of-the-art performance in articulation and mesh reconstruction, reducing the reconstruction error by about two orders of magnitude compared to existing methods.
arXiv Detail & Related papers (2025-09-22T11:52:02Z) - iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos [52.398752421673144]
We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera.<n>A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones.<n>We introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video.
arXiv Detail & Related papers (2025-06-10T01:41:46Z) - ObjectMover: Generative Object Movement with Video Prior [69.75281888309017]
We present ObjectMover, a generative model that can perform object movement in challenging scenes.<n>We show that with this approach, our model is able to adjust to complex real-world scenarios.<n>We propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization.
arXiv Detail & Related papers (2025-03-11T04:42:59Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - RELATE: Physically Plausible Multi-Object Scene Synthesis Using
Structured Latent Spaces [77.07767833443256]
We present RELATE, a model that learns to generate physically plausible scenes and videos of multiple interacting objects.
In contrast to state-of-the-art methods in object-centric generative modeling, RELATE also extends naturally to dynamic scenes and generates videos of high visual fidelity.
arXiv Detail & Related papers (2020-07-02T17:27:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.