Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
- URL: http://arxiv.org/abs/2511.01294v2
- Date: Tue, 04 Nov 2025 07:22:41 GMT
- Title: Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
- Authors: Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Jingyi Yu, Lan Xu,
- Abstract summary: We introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions.<n>Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry.
- Score: 59.51185639557874
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.
Related papers
- Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization [27.083888910311984]
Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks.<n>Existing methods struggle in cluttered environments.<n>We propose a unified optimization-based formulation for real-to-sim scene estimation.
arXiv Detail & Related papers (2026-02-23T18:58:24Z) - PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement [89.35154754765502]
PhyScensis is an agent-based framework powered by a physics engine to produce physically plausible scene configurations.<n>Our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters.<n> Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy.
arXiv Detail & Related papers (2026-02-16T17:55:25Z) - AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation [45.753757870577196]
We introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning.<n>We show that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses.
arXiv Detail & Related papers (2026-02-04T15:42:58Z) - sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only [20.99905717289565]
We present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera.<n>Our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding.<n>Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments.
arXiv Detail & Related papers (2025-12-08T16:38:30Z) - URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z) - GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects [4.717906057951389]
We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians.<n>This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts.<n>We show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types.
arXiv Detail & Related papers (2025-08-20T17:59:08Z) - ScrewSplat: An End-to-End Method for Articulated Object Recognition [11.498029485126045]
ScrewSplat is a simple end-to-end method that operates solely on RGB observations.<n>We demonstrate that our method achieves state-of-the-art recognition accuracy across a diverse set of articulated objects.
arXiv Detail & Related papers (2025-08-04T07:45:31Z) - Guiding Human-Object Interactions with Rich Geometry and Relations [21.528466852204627]
Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions.<n>We introduce ROG, a novel framework that addresses relationships inherent in HOIs with rich geometric detail.<n>We show that ROG significantly outperforms state-of-the-art methods in the realism evaluations and semantic accuracy of synthesized HOIs.
arXiv Detail & Related papers (2025-03-26T02:57:18Z) - Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation [88.83749146867665]
Existing approaches learn a policy to predict a distant next-best end-effector pose.<n>They then compute the corresponding joint rotation angles for motion using inverse kinematics.<n>We propose Kinematics enhanced Spatial-TemporAl gRaph diffuser.
arXiv Detail & Related papers (2025-03-13T17:48:35Z) - LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models [35.01842161084472]
We propose a new physical reasoning task and a dataset, dubbed TraySim.<n>Our task involves predicting the dynamics of several objects on a tray that is given an external impact.<n>We present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs.<n>Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance.
arXiv Detail & Related papers (2024-11-12T18:56:58Z) - Kinematic-aware Prompting for Generalizable Articulated Object
Manipulation with LLMs [53.66070434419739]
Generalizable articulated object manipulation is essential for home-assistant robots.
We propose a kinematic-aware prompting framework that prompts Large Language Models with kinematic knowledge of objects to generate low-level motion waypoints.
Our framework outperforms traditional methods on 8 categories seen and shows a powerful zero-shot capability for 8 unseen articulated object categories.
arXiv Detail & Related papers (2023-11-06T03:26:41Z) - Cycle Consistency Driven Object Discovery [75.60399804639403]
We introduce a method that explicitly optimize the constraint that each object in a scene should be associated with a distinct slot.
By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance.
Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
arXiv Detail & Related papers (2023-06-03T21:49:06Z) - Occlusion resistant learning of intuitive physics from videos [52.25308231683798]
Key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation.
This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences.
arXiv Detail & Related papers (2020-04-30T19:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.