3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
- URL: http://arxiv.org/abs/2506.09883v1
- Date: Wed, 11 Jun 2025 15:56:59 GMT
- Title: 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
- Authors: Seonho Lee, Jiho Choi, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim,
- Abstract summary: Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks.<n>We propose Geometric Distillation, a framework that injects human-inspired geometric cues into pretrained VLMs.<n>Our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs.
- Score: 17.294440057314812
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.
Related papers
- Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [8.090058633054852]
We introduce a plug-and-play module that implicitly injects 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding [58.38294408121273]
We propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D.<n>Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties.
arXiv Detail & Related papers (2025-03-20T20:58:48Z) - Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.<n>UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation [41.98740330990215]
This work proposes a novel approach that bridges 2D vision foundation models with 3D tasks.<n>We leverage the zero-shot capabilities of vision-language models for image semantics.<n>We project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision.
arXiv Detail & Related papers (2025-03-10T09:54:40Z) - MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors [11.118490283303407]
We propose a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D.
Our method produces accurate semantics and geometry in both 3D and 2D space.
arXiv Detail & Related papers (2024-09-21T05:12:13Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [111.16358607889609]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.<n>For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - Translational Symmetry-Aware Facade Parsing for 3D Building
Reconstruction [11.263458202880038]
In this paper, we present a novel translational symmetry-based approach to improving the deep neural networks.
We propose a novel scheme to fuse anchor-free detection in a single stage network, which enables the efficient training and better convergence.
We employ an off-the-shelf rendering engine like Blender to reconstruct the realistic high-quality 3D models using procedural modeling.
arXiv Detail & Related papers (2021-06-02T03:10:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.