VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception
- URL: http://arxiv.org/abs/2508.09061v1
- Date: Tue, 12 Aug 2025 16:25:27 GMT
- Title: VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception
- Authors: Fuhao Chang, Shuxin Li, Yabei Li, Lei He,
- Abstract summary: We propose VLM-3D, the first end-to-end framework that enables 3D geometric perception in autonomous driving scenarios.<n>VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead.<n>We show that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy.
- Score: 5.245213543721097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint semantic-geometric loss design: token-level semantic loss is applied during early training to ensure stable convergence, while 3D IoU loss is introduced in later stages to refine the accuracy of 3D bounding box predictions. Evaluations on the nuScenes dataset demonstrate that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy, fully validating the effectiveness and advancement of our method.
Related papers
- Spatial-aware Vision Language Model for Autonomous Driving [16.149511148218497]
Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models.<n>Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies.<n>We propose LVLDrive, a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving.
arXiv Detail & Related papers (2025-12-30T16:35:00Z) - Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving [48.512353531499286]
We introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that implicitly integrates 2D/3D scene understanding abilities within a single vision-language model (VLM)<n>We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios.<n>Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on 2D detection and nuScenes BEV 3D detection
arXiv Detail & Related papers (2025-11-24T15:28:25Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision [31.929268076595122]
OccVLA is a novel framework that integrates 3D occupancy representations into a unified multimodal reasoning process.<n>OccVLA achieves state-of-the-art results on the nuScenes benchmark for trajectory planning and demonstrates superior performance on 3D visual question-answering tasks.
arXiv Detail & Related papers (2025-09-06T03:47:21Z) - V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving [2.3302708486956454]
We introduce V3LMA, a novel approach that enhances 3D scene understanding by integrating Large Language Models (LLMs) with LVLMs.<n>V3LMA leverages textual descriptions generated from object detections and video inputs, significantly boosting performance without requiring fine-tuning.<n>Our method improves situational awareness and decision-making in complex traffic scenarios, achieving a score of 0.56 on the LingoQA benchmark.
arXiv Detail & Related papers (2025-04-30T20:00:37Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction [5.285847977231642]
3D semantic occupancy prediction is crucial for ensuring the safety in autonomous driving.
Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features.
We propose OccLoff, a framework that Learns to optimize Feature Fusion for 3D occupancy prediction.
arXiv Detail & Related papers (2024-11-06T06:34:27Z) - LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? [66.6886931183372]
We introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector.
Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks.
arXiv Detail & Related papers (2024-05-28T16:57:44Z) - Improving 3D Occupancy Prediction through Class-balancing Loss and Multi-scale Representation [7.651064601670273]
3D environment recognition is essential for autonomous driving systems.
Birds-Eye-View(BEV)-based perception has achieved the SOTA performance for this task.
We introduce a novel UNet-like Multi-scale Occupancy Head module to relieve this issue.
arXiv Detail & Related papers (2024-05-25T07:13:13Z) - FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models [59.13757801286343]
Few-shot class-incremental learning aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data.<n>We introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise.
arXiv Detail & Related papers (2023-12-28T14:52:07Z) - 3D Harmonic Loss: Towards Task-consistent and Time-friendly 3D Object
Detection on Edge for Intelligent Transportation System [28.55894241049706]
We propose a 3D harmonic loss function to relieve the pointcloud based inconsistent predictions.
Our proposed method considerably improves the performance than benchmark models.
Our code is open-source and publicly available.
arXiv Detail & Related papers (2022-11-07T10:11:48Z) - Unsupervised Domain Adaptation for Monocular 3D Object Detection via
Self-Training [57.25828870799331]
We propose STMono3D, a new self-teaching framework for unsupervised domain adaptation on Mono3D.
We develop a teacher-student paradigm to generate adaptive pseudo labels on the target domain.
STMono3D achieves remarkable performance on all evaluated datasets and even surpasses fully supervised results on the KITTI 3D object detection dataset.
arXiv Detail & Related papers (2022-04-25T12:23:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.