Related papers: BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

URL: http://arxiv.org/abs/2507.19370v1
Date: Fri, 25 Jul 2025 15:22:56 GMT
Title: BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving
Authors: Felix Brandstaetter, Erik Schuetz, Katharina Winter, Fabian Flohr,
Abstract summary: We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes.<n>Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset.<n>We release two new datasets to better assess scene captioning across diverse driving scenarios.
Score: 3.061835990893183
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5\% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.

Related papers

NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding.<n>We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs.<n>Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z)
MTA: Multimodal Task Alignment for BEV Perception and Captioning [13.25655273023121]
Bird's eye view (BEV)-based 3D perception plays a crucial role in autonomous driving applications.<n>Existing approaches treat perception and captioning as separate tasks, focusing on the performance of only one task.<n>We introduce MTA, a novel multimodal task alignment framework that boosts both BEV perception and captioning.
arXiv Detail & Related papers (2024-11-16T00:14:13Z)
Navigation Instruction Generation with BEV Perception and Large Language Models [60.455964599187205]
We propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation. Specifically, BEVInstructor constructs a PerspectiveBEV for the comprehension of 3D environments through fusing BEV and perspective features. Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner.
arXiv Detail & Related papers (2024-07-21T08:05:29Z)
OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems. We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance. Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z)
BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.<n>The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z)
Talk2BEV: Language-enhanced Bird's-eye View Maps for Autonomous Driving [23.957306230979746]
Talk2BEV is a vision-language model interface for bird's-eye view (BEV) maps in autonomous driving contexts. It blends recent advances in general-purpose language and vision models with BEV-structured map representations. We extensively evaluate Talk2BEV on a large number of scene understanding tasks.
arXiv Detail & Related papers (2023-10-03T17:53:51Z)
Bird's-Eye-View Scene Graph for Vision-Language Navigation [85.72725920024578]
Vision-language navigation (VLN) entails an agent to navigate 3D environments following human instructions. We present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views.
arXiv Detail & Related papers (2023-08-09T07:48:20Z)
SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection [46.92706423094971]
We propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features. We also propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature. Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-21T10:28:19Z)
Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years. Data-driven simulation for autonomous driving has been a focal point of recent research. We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z)
BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.