BEV-VLM: Trajectory Planning via Unified BEV Abstraction
- URL: http://arxiv.org/abs/2509.25249v1
- Date: Sat, 27 Sep 2025 07:13:55 GMT
- Title: BEV-VLM: Trajectory Planning via Unified BEV Abstraction
- Authors: Guancheng Chen, Sheng Yang, Tong Zhan, Jian Wang,
- Abstract summary: This paper introduces a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual inputs.<n>Our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps.<n> Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance.
- Score: 6.603679803036061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces BEV-VLM, a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual inputs. Unlike conventional approaches that rely solely on raw visual data such as camera images, our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps. This unified BEV-HD Map format provides a geometrically consistent and rich scene description, enabling VLMs to perform accurate trajectory planning. Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance. Our work highlights that VLMs can effectively interpret processed visual representations like BEV features, expanding their applicability beyond raw images in trajectory planning.
Related papers
- Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking [16.90910171943142]
Camera-based 3D object detection and tracking are essential for perception in autonomous driving.<n>Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features.<n>We propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features.
arXiv Detail & Related papers (2025-10-11T17:01:42Z) - ChatBEV: A Visual Language Model that Understands BEV Maps [58.3005092762598]
We introduce ChatBEV-QA, a novel BEV VQA benchmark containing over 137k questions.<n>This benchmark is constructed using a novel data collection pipeline that generates scalable and informative VQA data for BEV maps.<n>We propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance.
arXiv Detail & Related papers (2025-03-18T06:12:38Z) - SimBEV: A Synthetic Multi-Task Multi-Sensor Driving Data Generation Tool and Dataset [101.51012770913627]
Bird's-eye view (BEV) perception has garnered significant attention in autonomous driving in recent years.<n>SimBEV is a randomized synthetic data generation tool that is extensively scalable and scalable.<n>SimBEV is used to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios.
arXiv Detail & Related papers (2025-02-04T00:00:06Z) - VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car.
We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space.
Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z) - Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data [3.1968751101341173]
Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation.<n>While recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets.<n>We show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms.
arXiv Detail & Related papers (2024-07-11T17:57:22Z) - Bird's-Eye-View Scene Graph for Vision-Language Navigation [85.72725920024578]
Vision-language navigation (VLN) entails an agent to navigate 3D environments following human instructions.
We present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment.
Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views.
arXiv Detail & Related papers (2023-08-09T07:48:20Z) - FB-BEV: BEV Representation from Forward-Backward View Transformations [131.11787050205697]
We propose a novel View Transformation Module (VTM) for Bird-Eye-View (BEV) representation.
We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set.
arXiv Detail & Related papers (2023-08-04T10:26:55Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - BEV-MODNet: Monocular Camera based Bird's Eye View Moving Object
Detection for Autonomous Driving [2.9769485817170387]
CNNs can leverage the global context in the scene to project better.
We create an extended KITTI-raw dataset consisting of 12.9k images with annotations of moving object masks in BEV space for five classes.
We observe a significant improvement of 13% in mIoU using the simple baseline implementation.
arXiv Detail & Related papers (2021-07-11T01:11:58Z) - Multi-View Fusion of Sensor Data for Improved Perception and Prediction
in Autonomous Driving [11.312620949473938]
We present an end-to-end method for object detection and trajectory prediction utilizing multi-view representations of LiDAR and camera images.
Our model builds on a state-of-the-art Bird's-Eye View (BEV) network that fuses voxelized features from a sequence of historical LiDAR data.
We extend this model with additional LiDAR Range-View (RV) features that use the raw LiDAR information in its native, non-quantized representation.
arXiv Detail & Related papers (2020-08-27T03:32:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.