ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
- URL: http://arxiv.org/abs/2503.19755v1
- Date: Tue, 25 Mar 2025 15:18:43 GMT
- Title: ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
- Authors: Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai,
- Abstract summary: We propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation.<n>Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets.
- Score: 44.16465715911478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.
Related papers
- Two Tasks, One Goal: Uniting Motion and Planning for Excellent End To End Autonomous Driving Performance [14.665143402317685]
Former end-to-end autonomous driving approaches often decouple planning and motion tasks, treating them as separate modules.
We propose TTOG, a novel two-stage trajectory generation framework.
In the first stage, a diverse set of trajectory candidates is generated, while the second stage focuses on refining these candidates through vehicle state information.
To mitigate the issue of unavailable surrounding vehicle states, TTOG employs a self-vehicle data-trained state estimator, subsequently extended to other vehicles.
arXiv Detail & Related papers (2025-04-17T05:52:35Z) - RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving [10.984203470464687]
Vision-language models (VLMs) often suffer from limitations such as inadequate spatial perception and hallucination.
We propose a retrieval-augmented decision-making (RAD) framework to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes.
We fine-tune VLMs on a dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities.
arXiv Detail & Related papers (2025-03-18T03:25:57Z) - DiffAD: A Unified Diffusion Modeling Approach for Autonomous Driving [17.939192289319056]
We introduce DiffAD, a novel diffusion probabilistic model that redefines autonomous driving as a conditional image generation task.<n>Byizing heterogeneous targets onto a unified bird's-eye view (BEV) and modeling their latent distribution, DiffAD unifies various driving objectives.<n>The reverse process iteratively refines the generated BEV image, resulting in more robust and realistic driving behaviors.
arXiv Detail & Related papers (2025-03-15T15:23:35Z) - DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.
Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.
Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z) - DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - Making Large Language Models Better Planners with Reasoning-Decision Alignment [70.5381163219608]
We motivate an end-to-end decision-making model based on multimodality-augmented LLM.
We propose a reasoning-decision alignment constraint between the paired CoTs and planning results.
We dub our proposed large language planners with reasoning-decision alignment as RDA-Driver.
arXiv Detail & Related papers (2024-08-25T16:43:47Z) - Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving [59.705635382104454]
We present Bench2Drive, the first benchmark for evaluating E2E-AD systems' multiple abilities in a closed-loop manner.<n>We implement state-of-the-art E2E-AD models and evaluate them in Bench2Drive, providing insights regarding current status and future directions.
arXiv Detail & Related papers (2024-06-06T09:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.