Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping
- URL: http://arxiv.org/abs/2602.05029v1
- Date: Wed, 04 Feb 2026 20:33:50 GMT
- Title: Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping
- Authors: Octavio Arriaga, Proneet Sharma, Jichen Guo, Marc Otto, Siddhant Kadwe, Rebecca Adam,
- Abstract summary: We introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping.<n>Our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.
- Score: 0.820984376071696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.
Related papers
- Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z) - Is Single-View Mesh Reconstruction Ready for Robotics? [78.14584238127338]
We evaluate single-view mesh reconstruction models for their potential in enabling instant digital twin creation for real-time planning and dynamics prediction using physics simulators for robotic manipulation.<n>Our findings highlight critical gaps between computer vision advances and robotics needs, guiding future research at this intersection.
arXiv Detail & Related papers (2025-05-23T14:35:56Z) - ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping [40.288085021667065]
We introduce ZeroGrasp, a framework that simultaneously performs 3D reconstruction and grasp pose prediction in near real-time.<n>We evaluate ZeroGrasp on the GraspNet-1B benchmark as well as through real-world robot experiments.
arXiv Detail & Related papers (2025-04-15T04:37:39Z) - Targeted Hard Sample Synthesis Based on Estimated Pose and Occlusion Error for Improved Object Pose Estimation [9.637714330461037]
We propose a novel method of hard example synthesis that is model-agnostic.<n>We demonstrate an improvement in correct detection rate of up to 20% across several ROBI-dataset objects using state-of-the-art pose estimation models.
arXiv Detail & Related papers (2024-12-05T16:00:55Z) - Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation [66.3814684757376]
This work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level.
The outlined method shows reduction in data requirements, removal of the necessity of depth information in zero-shot category-level 6D pose estimation task, and increased performance, quantitatively demonstrated through experiments on the CO3D dataset.
arXiv Detail & Related papers (2024-03-21T10:38:18Z) - Learning Robust Multi-Scale Representation for Neural Radiance Fields
from Unposed Images [65.41966114373373]
We present an improved solution to the neural image-based rendering problem in computer vision.
The proposed approach could synthesize a realistic image of the scene from a novel viewpoint at test time.
arXiv Detail & Related papers (2023-11-08T08:18:23Z) - ZeroPose: CAD-Prompted Zero-shot Object 6D Pose Estimation in Cluttered Scenes [19.993163470302097]
ZeroPose is a novel framework that performs pose estimation following a Discovery-Orientation-Registration (DOR) inference pipeline.
It generalizes to novel objects without requiring model retraining.
It achieves comparable performance with object-specific training methods and outperforms the state-of-the-art zero-shot method with 50x inference speed improvement.
arXiv Detail & Related papers (2023-05-29T07:54:04Z) - Robust Category-Level 3D Pose Estimation from Synthetic Data [17.247607850702558]
We introduce SyntheticP3D, a new synthetic dataset for object pose estimation generated from CAD models.
We propose a novel approach (CC3D) for training neural mesh models that perform pose estimation via inverse rendering.
arXiv Detail & Related papers (2023-05-25T14:56:03Z) - Shape, Pose, and Appearance from a Single Image via Bootstrapped
Radiance Field Inversion [54.151979979158085]
We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available.
We leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution.
Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios.
arXiv Detail & Related papers (2022-11-21T17:42:42Z) - Neural-Sim: Learning to Generate Training Data with NeRF [31.81496344354997]
We present the first fully differentiable synthetic data pipeline that uses Neural Radiance Fields (NeRFs) in a closed-loop with a target application's loss function.
Our approach generates data on-demand, with no human labor, to maximize accuracy for a target task.
arXiv Detail & Related papers (2022-07-22T22:48:33Z) - Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes.
Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.