2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model
- URL: http://arxiv.org/abs/2509.02659v1
- Date: Tue, 02 Sep 2025 17:52:29 GMT
- Title: 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model
- Authors: Zilong Guo, Yi Luo, Long Sha, Dongxu Wang, Panqu Wang, Chenyang Xu, Yi Yang,
- Abstract summary: We show that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks.<n>It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard.
- Score: 21.811872482011534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.
Related papers
- LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z) - DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving [42.87581214382647]
We propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE.<n>DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models.<n>In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks.
arXiv Detail & Related papers (2025-05-22T06:23:04Z) - OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z) - SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment [15.223886922912842]
We propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment.<n>Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR.
arXiv Detail & Related papers (2025-03-12T17:58:06Z) - EMMA: End-to-End Multimodal Model for Autonomous Driving [56.972452552944056]
We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving.
Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs.
arXiv Detail & Related papers (2024-10-30T17:46:31Z) - MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving [11.045411890043919]
Vision-language models (VLMs) serve as general-purpose end-to-end models in autonomous driving.<n>Most existing methods rely on computationally expensive visual encoders and large language models (LLMs)<n>We propose a novel framework called MiniDrive, which incorporates our proposed Feature Engineering Mixture of Experts (FE-MoE) module and Dynamic Instruction Adapter (DI-Adapter)
arXiv Detail & Related papers (2024-09-11T13:43:01Z) - OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - Driving into the Future: Multiview Visual Forecasting and Planning with
World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models.
Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z) - Planning-oriented Autonomous Driving [60.93767791255728]
We argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car.
We introduce Unified Autonomous Driving (UniAD), a comprehensive framework that incorporates full-stack driving tasks in one network.
arXiv Detail & Related papers (2022-12-20T10:47:53Z) - CERBERUS: Simple and Effective All-In-One Automotive Perception Model
with Multi Task Learning [4.622165486890318]
In-vehicle embedded computing platforms cannot cope with the computational effort required to run a heavy model for each individual task.
We present CERBERUS, a lightweight model that leverages a multitask-learning approach to enable the execution of multiple perception tasks at the cost of a single inference.
arXiv Detail & Related papers (2022-10-03T08:17:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.