MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding
- URL: http://arxiv.org/abs/2512.12307v1
- Date: Sat, 13 Dec 2025 12:26:57 GMT
- Title: MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding
- Authors: Benjamin Beilharz, Thomas S. A. Wallis,
- Abstract summary: We introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties.<n>Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions.
- Score: 0.8123748648559669
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models' implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model's sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.
Related papers
- Vision-Language Embodiment for Monocular Depth Estimation [11.737279515161505]
Current depth estimation models rely on inter-image relationships for supervised training.<n>We propose a method that embodies the camera model and its physical characteristics into a deep learning model.<n>The model can calculate embodied scene depth in real-time based on immediate environmental changes.
arXiv Detail & Related papers (2025-03-18T18:05:16Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - Differentiable Blocks World: Qualitative 3D Decomposition by Rendering
Primitives [70.32817882783608]
We present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives.
Unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images.
We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points.
arXiv Detail & Related papers (2023-07-11T17:58:31Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Towards 3D Scene Understanding by Referring Synthetic Models [65.74211112607315]
Methods typically alleviate on-extensive annotations on real scene scans.
We explore how synthetic models rely on real scene categories of synthetic features to a unified feature space.
Experiments show that our method achieves the average mAP of 46.08% on the ScanNet S3DIS dataset and 55.49% by learning datasets.
arXiv Detail & Related papers (2022-03-20T13:06:15Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Weakly Supervised Learning of Multi-Object 3D Scene Decompositions Using
Deep Shape Priors [69.02332607843569]
PriSMONet is a novel approach for learning Multi-Object 3D scene decomposition and representations from single images.
A recurrent encoder regresses a latent representation of 3D shape, pose and texture of each object from an input RGB image.
We evaluate the accuracy of our model in inferring 3D scene layout, demonstrate its generative capabilities, assess its generalization to real images, and point out benefits of the learned representation.
arXiv Detail & Related papers (2020-10-08T14:49:23Z) - Equivariant Neural Rendering [22.95150913645939]
We propose a framework for learning neural scene representations directly from images, without 3D supervision.
Our key insight is that 3D structure can be imposed by ensuring that the learned representation transforms like a real 3D scene.
Our formulation allows us to infer and render scenes in real time while achieving comparable results to models requiring minutes for inference.
arXiv Detail & Related papers (2020-06-13T12:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.