Related papers: Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models

Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models

URL: http://arxiv.org/abs/2509.03837v1
Date: Thu, 04 Sep 2025 02:57:47 GMT
Title: Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models
Authors: Kimia Ehsani, Walid Saad,
Abstract summary: A lightweight, plug-and-play bird's-eye view (BEV) injection connector is proposed to overcome the limitations of large language models (MLLMs)<n>Ray tracing is developed to generate RGB, LiDAR, GPS, and wireless signal data across varied sensing scenarios.<n> Simulation results show that the proposed BEV injection framework consistently improved performance across all tasks.
Score: 41.00138090010061
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate prediction of communication link quality metrics is essential for vehicle-to-infrastructure (V2I) systems, enabling smooth handovers, efficient beam management, and reliable low-latency communication. The increasing availability of sensor data from modern vehicles motivates the use of multimodal large language models (MLLMs) because of their adaptability across tasks and reasoning capabilities. However, MLLMs inherently lack three-dimensional spatial understanding. To overcome this limitation, a lightweight, plug-and-play bird's-eye view (BEV) injection connector is proposed. In this framework, a BEV of the environment is constructed by collecting sensing data from neighboring vehicles. This BEV representation is then fused with the ego vehicle's input to provide spatial context for the large language model. To support realistic multimodal learning, a co-simulation environment combining CARLA simulator and MATLAB-based ray tracing is developed to generate RGB, LiDAR, GPS, and wireless signal data across varied scenarios. Instructions and ground-truth responses are programmatically extracted from the ray-tracing outputs. Extensive experiments are conducted across three V2I link prediction tasks: line-of-sight (LoS) versus non-line-of-sight (NLoS) classification, link availability, and blockage prediction. Simulation results show that the proposed BEV injection framework consistently improved performance across all tasks. The results indicate that, compared to an ego-only baseline, the proposed approach improves the macro-average of the accuracy metrics by up to 13.9%. The results also show that this performance gain increases by up to 32.7% under challenging rainy and nighttime conditions, confirming the robustness of the framework in adverse settings.

Related papers

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction [0.0]
VLMFusionOcc3D is a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving.<n>We introduce Weather-Aware Adaptive Fusion, a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions.<n>Our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
arXiv Detail & Related papers (2026-03-03T05:22:28Z)
MIDAR: Mimicking LiDAR Detection for Traffic Applications with a Lightweight Plug-and-Play Model [3.256565256248141]
MIDAR is a LiDAR detection mimicking model that approximates realistic LiDAR detections using vehicle-level features readily available from traffic simulators.<n>MIDAR achieves an AUC of 0.909 in approximating the detection results generated by CenterPoint on the nuScenes AD dataset.
arXiv Detail & Related papers (2025-08-04T19:35:05Z)
Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance [12.513296074529727]
This paper proposes the Real-time Edge-based Autonomous Co-pilot Trajectory planner (REACT) for autonomous driving.<n>REACT is a V2X-integrated trajectory optimization framework for AD based on a fine-tuned lightweight Vision-Language Model (VLM)<n> evaluated on the DeepAccident benchmark, REACT achieves state-of-the-art performance, a 77% collision rate reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference latency on the Jetson AGX Orin.
arXiv Detail & Related papers (2025-08-01T20:16:04Z)
NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models [24.239522252881336]
Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems.<n>Unsupervised and semi-supervised learning for BEV tasks underperform due to the homogeneous distribution of the labeled data.<n>This paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation.
arXiv Detail & Related papers (2025-07-05T11:05:43Z)
Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks [55.32199894495722]
We investigate an LMM-based vehicle AI assistant using a Large Language and Vision Assistant (LLaVA)<n>To reduce computational demands and shorten response time, we optimize LLaVA's image slicing to selectively focus on areas of utmost interest to users.<n>We construct a Visual Question Answering (VQA) dataset for traffic scenarios to evaluate effectiveness.
arXiv Detail & Related papers (2025-05-05T07:18:47Z)
Resource-Efficient Beam Prediction in mmWave Communications with Multimodal Realistic Simulation Framework [57.994965436344195]
Beamforming is a key technology in millimeter-wave (mmWave) communications that improves signal transmission by optimizing directionality and intensity.<n> multimodal sensing-aided beam prediction has gained significant attention, using various sensing data to predict user locations or network conditions.<n>Despite its promising potential, the adoption of multimodal sensing-aided beam prediction is hindered by high computational complexity, high costs, and limited datasets.
arXiv Detail & Related papers (2025-04-07T15:38:25Z)
Pruning-Based TinyML Optimization of Machine Learning Models for Anomaly Detection in Electric Vehicle Charging Infrastructure [8.29566258132752]
This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting EVCI.<n> optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance.<n> Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.
arXiv Detail & Related papers (2025-03-19T00:18:37Z)
SimBEV: A Synthetic Multi-Task Multi-Sensor Driving Data Generation Tool and Dataset [101.51012770913627]
Bird's-eye view (BEV) perception has garnered significant attention in autonomous driving in recent years.<n>SimBEV is a randomized synthetic data generation tool that is extensively scalable and scalable.<n>SimBEV is used to create the SimBEV dataset, a large collection of annotated perception data from diverse driving scenarios.
arXiv Detail & Related papers (2025-02-04T00:00:06Z)
Drivetrain simulation using variational autoencoders [0.0]
This work proposes variational autoencoders (VAEs) to predict a vehicle's jerk signals from torque demand.<n>We implement both unconditional and conditional VAEs, trained on experimental data from two variants of a fully electric SUV.
arXiv Detail & Related papers (2025-01-29T13:37:32Z)
OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries. OPUS incorporates a suite of non-trivial strategies to enhance model performance. Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z)
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving [55.93813178692077]
We present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms.<n>We assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction.<n>Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data.
arXiv Detail & Related papers (2024-05-27T17:59:39Z)
Cycle and Semantic Consistent Adversarial Domain Adaptation for Reducing Simulation-to-Real Domain Shift in LiDAR Bird's Eye View [110.83289076967895]
We present a BEV domain adaptation method based on CycleGAN that uses prior semantic classification in order to preserve the information of small objects of interest during the domain adaptation process. The quality of the generated BEVs has been evaluated using a state-of-the-art 3D object detection framework at KITTI 3D Object Detection Benchmark.
arXiv Detail & Related papers (2021-04-22T12:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.