Related papers: Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

URL: http://arxiv.org/abs/2602.05029v1
Date: Wed, 04 Feb 2026 20:33:50 GMT
Title: Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping
Authors: Octavio Arriaga, Proneet Sharma, Jichen Guo, Marc Otto, Siddhant Kadwe, Rebecca Adam,
Abstract summary: We introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping.<n>Our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.
Score: 0.820984376071696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.

Related papers

Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z)
Is Single-View Mesh Reconstruction Ready for Robotics? [78.14584238127338]
We evaluate single-view mesh reconstruction models for their potential in enabling instant digital twin creation for real-time planning and dynamics prediction using physics simulators for robotic manipulation.<n>Our findings highlight critical gaps between computer vision advances and robotics needs, guiding future research at this intersection.
arXiv Detail & Related papers (2025-05-23T14:35:56Z)
ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping [40.288085021667065]
We introduce ZeroGrasp, a framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time.<n>We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments.
arXiv Detail & Related papers (2025-04-15T04:37:39Z)
Targeted Hard Sample Synthesis Based on Estimated Pose and Occlusion Error for Improved Object Pose Estimation [9.637714330461037]
We propose a novel method of hard example synthesis that is model-agnostic.<n>We demonstrate an improvement in correct detection rate of up to 20% across several ROBI-dataset objects using state-of-the-art pose estimation models.
arXiv Detail & Related papers (2024-12-05T16:00:55Z)
Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation [66.3814684757376]
This work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level. The outlined method shows reduction in data requirements, removal of the necessity of depth information in zero-shot category-level 6D pose estimation task, and increased performance, quantitatively demonstrated through experiments on the CO3D dataset.
arXiv Detail & Related papers (2024-03-21T10:38:18Z)
Learning Robust Multi-Scale Representation for Neural Radiance Fields from Unposed Images [65.41966114373373]
We present an improved solution to the neural image-based rendering problem in computer vision. The proposed approach could synthesize a realistic image of the scene from a novel viewpoint at test time.
arXiv Detail & Related papers (2023-11-08T08:18:23Z)
ZeroPose: CAD-Prompted Zero-shot Object 6D Pose Estimation in Cluttered Scenes [19.993163470302097]
ZeroPose is a novel framework that performs pose estimation following a Discovery-Orientation-Registration (DOR) inference pipeline. It generalizes to novel objects without requiring model retraining. It achieves comparable performance with object-specific training methods and outperforms the state-of-the-art zero-shot method with 50x inference speed improvement.
arXiv Detail & Related papers (2023-05-29T07:54:04Z)
Robust Category-Level 3D Pose Estimation from Synthetic Data [17.247607850702558]
We introduce SyntheticP3D, a new synthetic dataset for object pose estimation generated from CAD models. We propose a novel approach (CC3D) for training neural mesh models that perform pose estimation via inverse rendering.
arXiv Detail & Related papers (2023-05-25T14:56:03Z)
Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion [54.151979979158085]
We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available. We leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution. Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios.
arXiv Detail & Related papers (2022-11-21T17:42:42Z)
Neural-Sim: Learning to Generate Training Data with NeRF [31.81496344354997]
We present the first fully differentiable synthetic data pipeline that uses Neural Radiance Fields (NeRFs) in a closed-loop with a target application's loss function. Our approach generates data on-demand, with no human labor, to maximize accuracy for a target task.
arXiv Detail & Related papers (2022-07-22T22:48:33Z)
Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes. Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.