Related papers: Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving

Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving

URL: http://arxiv.org/abs/2310.02251v2
Date: Tue, 14 Nov 2023 14:46:05 GMT
Title: Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving
Authors: Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, K. Madhava Krishna
Abstract summary: Talk2BEV is a vision-language model interface for bird's-eye view (BEV) maps in autonomous driving contexts. It blends recent advances in general-purpose language and vision models with BEV-structured map representations. We extensively evaluate Talk2BEV on a large number of scene understanding tasks.
Score: 23.957306230979746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Talk2BEV is a large vision-language model (LVLM) interface for bird's-eye view (BEV) maps in autonomous driving contexts. While existing perception systems for autonomous driving scenarios have largely focused on a pre-defined (closed) set of object categories and driving scenarios, Talk2BEV blends recent advances in general-purpose language and vision models with BEV-structured map representations, eliminating the need for task-specific models. This enables a single system to cater to a variety of autonomous driving tasks encompassing visual and spatial reasoning, predicting the intents of traffic actors, and decision-making based on visual cues. We extensively evaluate Talk2BEV on a large number of scene understanding tasks that rely on both the ability to interpret free-form natural language queries, and in grounding these queries to the visual context embedded into the language-enhanced BEV map. To enable further research in LVLMs for autonomous driving scenarios, we develop and release Talk2BEV-Bench, a benchmark encompassing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset.

Related papers

BEV-VLM: Trajectory Planning via Unified BEV Abstraction [6.603679803036061]
This paper introduces a novel framework for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual inputs.<n>Our method utilizes highly compressed and informative BEV representations, which are generated by fusing multi-modal sensor data (e.g., camera and LiDAR) and aligning them with HD Maps.<n> Experimental results on the nuScenes dataset demonstrate 44.8% improvements in planning accuracy and complete collision avoidance.
arXiv Detail & Related papers (2025-09-27T07:13:55Z)
BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving [3.061835990893183]
We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes.<n>Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset.<n>We release two new datasets to better assess scene captioning across diverse driving scenarios.
arXiv Detail & Related papers (2025-07-25T15:22:56Z)
ChatBEV: A Visual Language Model that Understands BEV Maps [58.3005092762598]
We introduce ChatBEV-QA, a novel BEV VQA benchmark containing over 137k questions. This benchmark is constructed using a novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance.
arXiv Detail & Related papers (2025-03-18T06:12:38Z)
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z)
BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving [0.5735035463793009]
Large Language Models (LLMs) possess reasoning capabilities and natural language understanding. We introduce BEVDriver, an LLM-based model for end-to-end closed-loop driving in CARLA. On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.
arXiv Detail & Related papers (2025-03-05T00:27:32Z)
SimBEV: A Synthetic Multi-Task Multi-Sensor Driving Data Generation Tool and Dataset [101.51012770913627]
Bird's-eye view (BEV) perception has garnered significant attention in autonomous driving in recent years. SimBEV is a randomized synthetic data generation tool that is extensively scalable and scalable. SimBEV is used to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios.
arXiv Detail & Related papers (2025-02-04T00:00:06Z)
MTA: Multimodal Task Alignment for BEV Perception and Captioning [13.82751518921778]
Bird's eye view (BEV)-based 3D perception plays a crucial role in autonomous driving applications. Existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one of the tasks. We introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning.
arXiv Detail & Related papers (2024-11-16T00:14:13Z)
VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization [108.68014173017583]
Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car. We propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens
arXiv Detail & Related papers (2024-11-03T16:09:47Z)
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z)
SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving [15.551625571158056]
We propose an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, which are perception, prediction, planning, and behavior. Our experiments demonstrate that SimpleLLM4AD achieves competitive performance in complex driving scenarios.
arXiv Detail & Related papers (2024-07-31T02:35:33Z)
Navigation Instruction Generation with BEV Perception and Large Language Models [60.455964599187205]
We propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation. Specifically, BEVInstructor constructs a PerspectiveBEV for the comprehension of 3D environments through fusing BEV and perspective features. Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner.
arXiv Detail & Related papers (2024-07-21T08:05:29Z)
Map It Anywhere (MIA): Empowering Bird's Eye View Mapping using Large-scale Public Data [3.1968751101341173]
Top-down Bird's Eye View (BEV) maps are a popular representation for ground robot navigation due to their richness and flexibility for downstream tasks. Recent methods have shown promise for predicting BEV maps from First-Person View (FPV) images, their generalizability is limited to small regions captured by current autonomous vehicle-based datasets. We show that a more scalable approach towards generalizable map prediction can be enabled by using two large-scale crowd-sourced mapping platforms.
arXiv Detail & Related papers (2024-07-11T17:57:22Z)
BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight [30.45553559416835]
We propose BLOS-BEV, a novel BEV segmentation model that incorporates SD maps for accurate beyond line-of-sight perception, up to 200m. Our approach is applicable to common BEV architectures and can achieve excellent results by incorporating information derived from SD maps.
arXiv Detail & Related papers (2024-07-11T14:15:48Z)
BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space [57.68134574076005]
We present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View latent space for environment modeling. Experiments demonstrate the effectiveness of BEVWorld in autonomous driving tasks, showcasing its capability in generating future scenes and benefiting downstream tasks such as perception and motion prediction.
arXiv Detail & Related papers (2024-07-08T07:26:08Z)
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks. We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z)
Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years. Data-driven simulation for autonomous driving has been a focal point of recent research. We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z)
Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe [115.31507979199564]
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance. The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios.
arXiv Detail & Related papers (2022-09-12T15:29:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.