Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation
- URL: http://arxiv.org/abs/2602.06032v2
- Date: Wed, 11 Feb 2026 13:25:58 GMT
- Title: Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation
- Authors: David Shavin, Sagie Benaim,
- Abstract summary: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks.<n>Despite their effectiveness, they often exhibit a critical lack of 3D awareness.<n>We introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline.
- Score: 10.159228582199924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
Related papers
- DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation [51.43837087865105]
Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition.<n>Their potential in 3D vision remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets.<n>We introduce DITR, a simple yet effective approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model.
arXiv Detail & Related papers (2025-03-24T17:59:11Z) - Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.<n>UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision [65.33043028101471]
We present a novel framework for training 3D image-conditioned diffusion models using only 2D supervision.<n>Most existing 3D generative models rely on full 3D supervision, which is impractical due to the scarcity of large-scale 3D datasets.
arXiv Detail & Related papers (2024-12-01T00:29:57Z) - Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data.
We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z) - Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation [67.36775428466045]
We propose Geometry Guided Self-Distillation (GGSD) to learn superior 3D representations from 2D pre-trained models.
Due to the advantages of 3D representation, the performance of the distilled 3D student model can significantly surpass that of the 2D teacher model.
arXiv Detail & Related papers (2024-07-18T10:13:56Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.