Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning
- URL: http://arxiv.org/abs/2502.14917v1
- Date: Wed, 19 Feb 2025 09:50:44 GMT
- Title: Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning
- Authors: Rui Zhao, Qirui Yuan, Jinyu Li, Haofeng Hu, Yun Li, Chengyuan Zheng, Fei Gao,
- Abstract summary: We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning framework framework.<n>It reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control.<n>It achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.
- Score: 24.511628941825116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an important part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve generalization and consensus in cross-scene driving. We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning MLLM framework. Sce2DriveX utilizes multimodal joint learning from local scene videos and global BEV maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its comprehensive perception and reasoning capabilities in 3D dynamic/static scenes and achieving driving generalization across scenes. Building on this, it reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control, thereby further bridging the gap between autonomous driving and human thought processes. To elevate model performance, we have developed the first extensive Visual Question Answering (VQA) driving instruction dataset tailored for 3D spatial understanding and long-axis task reasoning. Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.
Related papers
- OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.
Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z) - OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model [24.90085777003393]
We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving.
OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions.
arXiv Detail & Related papers (2025-03-30T14:45:54Z) - InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving [3.8737986316149775]
We propose a novel end-to-end autonomous driving method called InsightDrive.
It organizes perception by language-guided scene representation.
In experiments, InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving.
arXiv Detail & Related papers (2025-03-17T10:52:32Z) - SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment [15.223886922912842]
We propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment.
Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR.
arXiv Detail & Related papers (2025-03-12T17:58:06Z) - Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop.
The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z) - Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving [2.0122032639916485]
We analyze effective knowledge distillation of semantic labels to smaller Vision networks.<n>This can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
arXiv Detail & Related papers (2025-01-12T01:31:07Z) - Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z) - DualAD: Dual-Layer Planning for Reasoning in Autonomous Driving [1.8434042562191815]
We present a novel autonomous driving framework, DualAD, designed to imitate human reasoning during driving.<n>DualAD comprises two layers: a rule-based motion planner at the bottom layer that handles routine driving tasks requiring minimal reasoning, and an upper layer featuring a rule-based text encoder.
arXiv Detail & Related papers (2024-09-26T16:58:04Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning [68.45848423501927]
We propose a holistic framework for strong alignment between agent models and 3D driving tasks.
Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D.
We propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - On the Road with GPT-4V(ision): Early Explorations of Visual-Language
Model on Autonomous Driving [37.617793990547625]
This report provides an exhaustive evaluation of the latest state-of-the-art VLM, GPT-4V.
We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver.
Our findings reveal that GPT-4V demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems.
arXiv Detail & Related papers (2023-11-09T12:58:37Z) - OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping [84.65114565766596]
We present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure.
OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes.
We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.
arXiv Detail & Related papers (2023-04-20T16:31:22Z) - Graph-based Topology Reasoning for Driving Scenes [102.35885039110057]
We present TopoNet, the first end-to-end framework capable of abstracting traffic knowledge beyond conventional perception tasks.
We evaluate TopoNet on the challenging scene understanding benchmark, OpenLane-V2.
arXiv Detail & Related papers (2023-04-11T15:23:29Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.