Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?
- URL: http://arxiv.org/abs/2405.18361v1
- Date: Tue, 28 May 2024 16:57:44 GMT
- Title: Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?
- Authors: Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang,
- Abstract summary: We introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector.
Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks.
- Score: 66.6886931183372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.
Related papers
- Spatial-aware Vision Language Model for Autonomous Driving [16.149511148218497]
Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models.<n>Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies.<n>We propose LVLDrive, a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving.
arXiv Detail & Related papers (2025-12-30T16:35:00Z) - D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z) - LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight [105.9472902251177]
We present a VLM-native recipe that casts 3D detection as a next-token prediction problem.<n>Our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement.
arXiv Detail & Related papers (2025-11-25T18:59:45Z) - Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z) - Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy [4.1703677379815565]
We propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data.<n>In our method, the geometric prior are directly used to improve the performance of the sceen perception.<n>Experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Captioning and 3D Visual Grounding tasks.
arXiv Detail & Related papers (2025-09-29T07:34:18Z) - OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision [31.929268076595122]
OccVLA is a novel framework that integrates 3D occupancy representations into a unified multimodal reasoning process.<n>OccVLA achieves state-of-the-art results on the nuScenes benchmark for trajectory planning and demonstrates superior performance on 3D visual question-answering tasks.
arXiv Detail & Related papers (2025-09-06T03:47:21Z) - VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception [5.245213543721097]
We propose VLM-3D, the first end-to-end framework that enables 3D geometric perception in autonomous driving scenarios.<n>VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead.<n>We show that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy.
arXiv Detail & Related papers (2025-08-12T16:25:27Z) - Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs [72.11701578308804]
This paper categorizes recent 3D Vision-Language Models into 3D object-centric, 2D image-based, and 3D scene-centric approaches.<n>Despite the architectural similarity of 3D scene-centric VLMs to their 2D counterparts, they have exhibited comparatively lower performance compared with the latest 3D object-centric and 2D image-based approaches.<n>Our investigation suggests that while these models possess cross-modal alignment capabilities, they tend to over-rely on linguistic cues and overfit to frequent answer distributions.
arXiv Detail & Related papers (2025-06-05T17:56:12Z) - VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z) - Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving [45.82124136705798]
DriveMonkey is a framework that seamlessly integrates Large Visual-Language Models with a spatial processor.<n>Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task.
arXiv Detail & Related papers (2025-05-13T16:36:51Z) - Empowering Large Language Models with 3D Situation Awareness [84.12071023036636]
A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions.
We propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection.
We introduce a situation grounding module to explicitly predict the position and orientation of observer's viewpoint, thereby enabling LLMs to ground situation description in 3D scenes.
arXiv Detail & Related papers (2025-03-29T09:34:16Z) - CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation [12.406655155106424]
CleverDistiller is a self-supervised, cross-modal 2D-to-3D KD framework.
It achieves state-of-the-art performance in both semantic segmentation and 3D object detection by up to 10% mIoU.
arXiv Detail & Related papers (2025-03-12T22:18:29Z) - 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.
VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.
We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z) - SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE [28.597376637565123]
This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects.
By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly.
Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality.
arXiv Detail & Related papers (2024-11-25T19:00:05Z) - VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding [57.04804711488706]
3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding.
We present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images.
arXiv Detail & Related papers (2024-10-17T17:59:55Z) - LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories.
We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning [68.45848423501927]
We propose a holistic framework for strong alignment between agent models and 3D driving tasks.
Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D.
We propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem.
We employ a Neural Message Passing network for data association that is fully trainable.
We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.