ChatBEV: A Visual Language Model that Understands BEV Maps
- URL: http://arxiv.org/abs/2503.13938v2
- Date: Fri, 21 Mar 2025 02:17:52 GMT
- Title: ChatBEV: A Visual Language Model that Understands BEV Maps
- Authors: Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, Ya Zhang,
- Abstract summary: We introduce ChatBEV-QA, a novel BEV VQA benchmark containing over 137k questions.<n>This benchmark is constructed using a novel data collection pipeline that generates scalable and informative VQA data for BEV maps.<n>We propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance.
- Score: 58.3005092762598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.
Related papers
- NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding.<n>We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs.<n>Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z) - SimBEV: A Synthetic Multi-Task Multi-Sensor Driving Data Generation Tool and Dataset [101.51012770913627]
Bird's-eye view (BEV) perception for autonomous driving has garnered significant attention in recent years.<n>We introduce SimBEV, a synthetic data generation tool that incorporates information from multiple sources to capture accurate BEV ground truth data.<n>We use SimBEV to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios.
arXiv Detail & Related papers (2025-02-04T00:00:06Z) - VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car.
We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space.
Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z) - CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours.<n>This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z) - Navigation Instruction Generation with BEV Perception and Large Language Models [60.455964599187205]
We propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation.
Specifically, BEVInstructor constructs a PerspectiveBEV for the comprehension of 3D environments through fusing BEV and perspective features.
Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner.
arXiv Detail & Related papers (2024-07-21T08:05:29Z) - Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving [23.957306230979746]
Talk2BEV is a vision-language model interface for bird's-eye view (BEV) maps in autonomous driving contexts.
It blends recent advances in general-purpose language and vision models with BEV-structured map representations.
We extensively evaluate Talk2BEV on a large number of scene understanding tasks.
arXiv Detail & Related papers (2023-10-03T17:53:51Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - Structured Bird's-Eye-View Traffic Scene Understanding from Onboard
Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image.
We show that the method can be extended to detect dynamic objects on the BEV plane.
We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.