Related papers: Probing the 3D Awareness of Visual Foundation Models

Probing the 3D Awareness of Visual Foundation Models

URL: http://arxiv.org/abs/2404.08636v1
Date: Fri, 12 Apr 2024 17:58:04 GMT
Title: Probing the 3D Awareness of Visual Foundation Models
Authors: Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani,
Abstract summary: We analyze the 3D awareness of visual foundation models. We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
Score: 56.68380136809413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.

Related papers

Sparse Multiview Open-Vocabulary 3D Detection [27.57172918603858]
3D object detection has traditionally been solved by training to detect a fixed set of categories.<n>In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting.<n>Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion.
arXiv Detail & Related papers (2025-09-19T12:22:24Z)
Feat2GS: Probing Visual Foundation Models with Gaussian Splatting [53.86049874165888]
Visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images. We introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. We conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM.
arXiv Detail & Related papers (2024-12-12T18:59:28Z)
Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data. We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z)
Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics. We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z)
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts. Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z)
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Visual Localization using Imperfect 3D Models from the Internet [54.731309449883284]
This paper studies how imperfections in 3D models affect localization accuracy. We show that 3D models from the Internet show promise as an easy-to-obtain scene representation.
arXiv Detail & Related papers (2023-04-12T16:15:05Z)
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets. Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications. In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
Leveraging 2D Data to Learn Textured 3D Mesh Generation [33.32377849866736]
We present the first generative model of textured 3D meshes. We train our model to explain a distribution of images by modelling each image as a 3D foreground object. It learns to generate meshes that when rendered, produce images similar to those in its training set.
arXiv Detail & Related papers (2020-04-08T18:00:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.