Related papers: 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model

2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model

URL: http://arxiv.org/abs/2509.02659v1
Date: Tue, 02 Sep 2025 17:52:29 GMT
Title: 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model
Authors: Zilong Guo, Yi Luo, Long Sha, Dongxu Wang, Panqu Wang, Chenyang Xu, Yi Yang,
Abstract summary: We show that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks.<n>It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard.
Score: 21.811872482011534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.

Related papers

LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z)
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving [42.87581214382647]
We propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE.<n>DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models.<n>In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks.
arXiv Detail & Related papers (2025-05-22T06:23:04Z)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z)
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment [15.223886922912842]
We propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment.<n>Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR.
arXiv Detail & Related papers (2025-03-12T17:58:06Z)
EMMA: End-to-End Multimodal Model for Autonomous Driving [56.972452552944056]
We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs.
arXiv Detail & Related papers (2024-10-30T17:46:31Z)
MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving [11.045411890043919]
Vision-language models (VLMs) serve as general-purpose end-to-end models in autonomous driving.<n>Most existing methods rely on computationally expensive visual encoders and large language models (LLMs)<n>We propose a novel framework called MiniDrive, which incorporates our proposed Feature Engineering Mixture of Experts (FE-MoE) module and Dynamic Instruction Adapter (DI-Adapter)
arXiv Detail & Related papers (2024-09-11T13:43:01Z)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z)
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models. Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z)
Planning-oriented Autonomous Driving [60.93767791255728]
We argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework that incorporates full-stack driving tasks in one network.
arXiv Detail & Related papers (2022-12-20T10:47:53Z)
CERBERUS: Simple and Effective All-In-One Automotive Perception Model with Multi Task Learning [4.622165486890318]
In-vehicle embedded computing platforms cannot cope with the computational effort required to run a heavy model for each individual task. We present CERBERUS, a lightweight model that leverages a multitask-learning approach to enable the execution of multiple perception tasks at the cost of a single inference.
arXiv Detail & Related papers (2022-10-03T08:17:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.