Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs
- URL: http://arxiv.org/abs/2510.21867v1
- Date: Thu, 23 Oct 2025 11:41:51 GMT
- Title: Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs
- Authors: Haicheng Liao, Bonan Wang, Junxian Yang, Chengyue Wang, Zhengbin He, Guohui Zhang, Chengzhong Xu, Zhenning Li,
- Abstract summary: We present WM-MoE, the first world model-based motion forecasting framework.<n>It unifies perception, temporal memory, and decision making to address the challenges of high-risk corner-case scenarios.<n> WM-MoE consistently outperforms state-of-the-art (SOTA) baselines and remains robust under corner-case and data-missing conditions.
- Score: 30.363301425068162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate and reliable motion forecasting is essential for the safe deployment of autonomous vehicles (AVs), particularly in rare but safety-critical scenarios known as corner cases. Existing models often underperform in these situations due to an over-representation of common scenes in training data and limited generalization capabilities. To address this limitation, we present WM-MoE, the first world model-based motion forecasting framework that unifies perception, temporal memory, and decision making to address the challenges of high-risk corner-case scenarios. The model constructs a compact scene representation that explains current observations, anticipates future dynamics, and evaluates the outcomes of potential actions. To enhance long-horizon reasoning, we leverage large language models (LLMs) and introduce a lightweight temporal tokenizer that maps agent trajectories and contextual cues into the LLM's feature space without additional training, enriching temporal context and commonsense priors. Furthermore, a mixture-of-experts (MoE) is introduced to decompose complex corner cases into subproblems and allocate capacity across scenario types, and a router assigns scenes to specialized experts that infer agent intent and perform counterfactual rollouts. In addition, we introduce nuScenes-corner, a new benchmark that comprises four real-world corner-case scenarios for rigorous evaluation. Extensive experiments on four benchmark datasets (nuScenes, NGSIM, HighD, and MoCAD) showcase that WM-MoE consistently outperforms state-of-the-art (SOTA) baselines and remains robust under corner-case and data-missing conditions, indicating the promise of world model-based architectures for robust and generalizable motion forecasting in fully AVs.
Related papers
- It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks [87.7937890373758]
Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation.<n>We introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks.<n>We propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels.
arXiv Detail & Related papers (2026-02-12T16:31:01Z) - CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems [29.385460126069386]
We introduce a novel benchmark, CogRail, which integrates curated datasets with cognitively driven question-answer annotations.<n>Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models.<n>We propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis.
arXiv Detail & Related papers (2026-01-14T16:36:26Z) - Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs [10.48956192789531]
Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems.<n>Current benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with.<n>temporal dynamics.<n>We introduce textbfHazardForge, a scalable pipeline that leverages image editing models to generate.<n>scenarios with layout decision algorithms, and validation modules.
arXiv Detail & Related papers (2026-01-13T11:55:31Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z) - INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation [7.362380225654904]
INSIGHT is a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation.<n>By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios.<n> Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models.
arXiv Detail & Related papers (2025-02-01T01:43:53Z) - Realistic Corner Case Generation for Autonomous Vehicles with Multimodal Large Language Model [10.741225574706]
AutoScenario is a framework for realistic corner case generation.<n>It converts safety-critical real-world data from multiple sources into textual representations.<n>It integrates tools from the Simulation of Urban Mobility (SUMO) and CARLA simulators.
arXiv Detail & Related papers (2024-11-29T20:23:28Z) - Generating Out-Of-Distribution Scenarios Using Language Models [58.47597351184034]
Large Language Models (LLMs) have shown promise in autonomous driving.
This paper introduces a framework for generating diverse Out-Of-Distribution (OOD) driving scenarios.
We evaluate our framework through extensive simulations and introduce a new "OOD-ness" metric.
arXiv Detail & Related papers (2024-11-25T16:38:17Z) - GraphSCENE: On-Demand Critical Scenario Generation for Autonomous Vehicles in Simulation [11.896059467313668]
This work introduces a novel method that generates dynamic temporal scene graphs corresponding to diverse traffic scenarios, on-demand, tailored to user-defined preferences.<n>A temporal Graph Neural Network (GNN) model learns to predict relationships between ego-vehicle agents and static structures, guided by real-world interaction patterns.<n>We render the predicted scenarios in simulation to further demonstrate their effectiveness as testing environments for AV agents.
arXiv Detail & Related papers (2024-10-17T13:02:06Z) - SAFE-SIM: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries [94.84458417662407]
We introduce SAFE-SIM, a controllable closed-loop safety-critical simulation framework.
Our approach yields two distinct advantages: 1) generating realistic long-tail safety-critical scenarios that closely reflect real-world conditions, and 2) providing controllable adversarial behavior for more comprehensive and interactive evaluations.
We validate our framework empirically using the nuScenes and nuPlan datasets across multiple planners, demonstrating improvements in both realism and controllability.
arXiv Detail & Related papers (2023-12-31T04:14:43Z) - JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds [79.00975648564483]
Trajectory forecasting models, employed in fields such as robotics, autonomous vehicles, and navigation, face challenges in real-world scenarios.
This dataset provides comprehensive data, including the locations of all agents, scene images, and point clouds, all from the robot's perspective.
The objective is to predict the future positions of agents relative to the robot using raw sensory input data.
arXiv Detail & Related papers (2023-11-05T18:59:31Z) - MUSE-VAE: Multi-Scale VAE for Environment-Aware Long Term Trajectory
Prediction [28.438787700968703]
Conditional MUSE offers diverse and simultaneously more accurate predictions compared to the current state-of-the-art.
We demonstrate these assertions through a comprehensive set of experiments on nuScenes and SDD benchmarks as well as PFSD, a new synthetic dataset.
arXiv Detail & Related papers (2022-01-18T18:40:03Z) - Temporal Predictive Coding For Model-Based Planning In Latent Space [80.99554006174093]
We present an information-theoretic approach that employs temporal predictive coding to encode elements in the environment that can be predicted across time.
We evaluate our model on a challenging modification of standard DMControl tasks where the background is replaced with natural videos that contain complex but irrelevant information to the planning task.
arXiv Detail & Related papers (2021-06-14T04:31:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.