Renderers are Good Zero-Shot Representation Learners: Exploring
Diffusion Latents for Metric Learning
- URL: http://arxiv.org/abs/2306.10721v1
- Date: Mon, 19 Jun 2023 06:41:44 GMT
- Title: Renderers are Good Zero-Shot Representation Learners: Exploring
Diffusion Latents for Metric Learning
- Authors: Michael Tang, David Shustin
- Abstract summary: We use retrieval as a proxy for measuring the metric learning properties of the latent spaces of Shap-E.
We find that Shap-E representations outperform those of the classical EfficientNet baseline representations zero-shot.
These findings give preliminary indication that 3D-based rendering and generative models can yield useful representations for discriminative tasks in our innately 3D-native world.
- Score: 1.0152838128195467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can the latent spaces of modern generative neural rendering models serve as
representations for 3D-aware discriminative visual understanding tasks? We use
retrieval as a proxy for measuring the metric learning properties of the latent
spaces of Shap-E, including capturing view-independence and enabling the
aggregation of scene representations from the representations of individual
image views, and find that Shap-E representations outperform those of the
classical EfficientNet baseline representations zero-shot, and is still
competitive when both methods are trained using a contrative loss. These
findings give preliminary indication that 3D-based rendering and generative
models can yield useful representations for discriminative tasks in our
innately 3D-native world. Our code is available at
\url{https://github.com/michaelwilliamtang/golden-retriever}.
Related papers
- Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Cross-Dimensional Refined Learning for Real-Time 3D Visual Perception
from Monocular Video [2.2299983745857896]
We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels.
We propose an end-to-end cross-dimensional refinement neural network (CDRNet) to extract both 3D mesh and 3D semantic labeling in real time.
arXiv Detail & Related papers (2023-03-16T11:53:29Z) - Spatio-temporal Self-Supervised Representation Learning for 3D Point
Clouds [96.9027094562957]
We introduce a-temporal representation learning framework, capable of learning from unlabeled tasks.
Inspired by how infants learn from visual data in the wild, we explore rich cues derived from the 3D data.
STRL takes two temporally-related frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly.
arXiv Detail & Related papers (2021-09-01T04:17:11Z) - Disentangling Semantic-to-visual Confusion for Zero-shot Learning [13.610995960100869]
We develop a novel model called Disentangling Class Representation Generative Adrial Network (DCR-GAN)
Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features.
Our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets.
arXiv Detail & Related papers (2021-06-16T08:04:11Z) - Image GANs meet Differentiable Rendering for Inverse Graphics and
Interpretable 3D Neural Rendering [101.56891506498755]
Differentiable rendering has paved the way to training neural networks to perform "inverse graphics" tasks.
We show that our approach significantly outperforms state-of-the-art inverse graphics networks trained on existing datasets.
arXiv Detail & Related papers (2020-10-18T22:29:07Z) - Equivariant Neural Rendering [22.95150913645939]
We propose a framework for learning neural scene representations directly from images, without 3D supervision.
Our key insight is that 3D structure can be imposed by ensuring that the learned representation transforms like a real 3D scene.
Our formulation allows us to infer and render scenes in real time while achieving comparable results to models requiring minutes for inference.
arXiv Detail & Related papers (2020-06-13T12:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.