CarLLaVA: Vision language models for camera-only closed-loop driving
- URL: http://arxiv.org/abs/2406.10165v1
- Date: Fri, 14 Jun 2024 16:35:47 GMT
- Title: CarLLaVA: Vision language models for camera-only closed-loop driving
- Authors: Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski,
- Abstract summary: We present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0.
CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance.
We show preliminary results on predicting language commentary alongside the driving output.
- Score: 14.852612275631671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.
Related papers
- SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment [15.223886922912842]
We propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment.
Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR.
arXiv Detail & Related papers (2025-03-12T17:58:06Z) - BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving [0.5735035463793009]
Large Language Models (LLMs) possess reasoning capabilities and natural language understanding.
We introduce BEVDriver, an LLM-based model for end-to-end closed-loop driving in CARLA.
On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.
arXiv Detail & Related papers (2025-03-05T00:27:32Z) - VaViM and VaVAM: Autonomous Driving through Video Generative Modeling [88.33638585518226]
We introduce an open-source auto-regressive video model (VaM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving.
We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving.
arXiv Detail & Related papers (2025-02-21T18:56:02Z) - HE-Drive: Human-Like End-to-End Driving with Vision Language Models [11.845309076856365]
We propose HE-Drive: the first human-like-centric end-to-end autonomous driving system.
We show that HE-Drive achieves state-of-the-art performance (i.e., reduces the average collision rate by 71% than VAD) and efficiency (i.e., 1.9X faster than SparseDrive) on datasets.
arXiv Detail & Related papers (2024-10-07T14:06:16Z) - End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation [34.070813293944944]
We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD)
Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks.
Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA's Town05 Long benchmark.
arXiv Detail & Related papers (2024-06-25T16:12:52Z) - Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.
LAW predicts future scene features based on current features and ego trajectories.
This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases [102.05741859030951]
We propose CODA-LM, the first benchmark for the automatic evaluation of LVLMs for self-driving corner cases.
We show that using the text-only large language models as judges reveals even better alignment with human preferences than the LVLM judges.
Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task.
arXiv Detail & Related papers (2024-04-16T14:20:55Z) - VLP: Vision Language Planning for Autonomous Driving [52.640371249017335]
This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving.
It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
arXiv Detail & Related papers (2024-01-10T23:00:40Z) - DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral
Planning States for Autonomous Driving [69.82743399946371]
DriveMLM is a framework that can perform close-loop autonomous driving in realistic simulators.
We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system.
This model can plug-and-play in existing AD systems such as Apollo for close-loop driving.
arXiv Detail & Related papers (2023-12-14T18:59:05Z) - Looking Inside Out: Anticipating Driver Intent From Videos [20.501288763809036]
Driver intention can be leveraged to improve road safety, such as warning surrounding vehicles in the event the driver is attempting a dangerous maneuver.
We propose a novel method of utilizing in-cabin and external camera data to improve state-of-the-art (SOTA) performance in predicting future driver actions.
Our models predict driver maneuvers more accurately and earlier than existing approaches, with an accuracy of 87.5% and an average prediction time of 4.35 seconds before the maneuver takes place.
arXiv Detail & Related papers (2023-12-03T16:24:50Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z) - Real Time Monocular Vehicle Velocity Estimation using Synthetic Data [78.85123603488664]
We look at the problem of estimating the velocity of road vehicles from a camera mounted on a moving car.
We propose a two-step approach where first an off-the-shelf tracker is used to extract vehicle bounding boxes and then a small neural network is used to regress the vehicle velocity.
arXiv Detail & Related papers (2021-09-16T13:10:27Z) - YOLOP: You Only Look Once for Panoptic Driving Perception [21.802146960999394]
We present a panoptic driving perception network (YOLOP) to perform traffic object detection, drivable area segmentation and lane detection simultaneously.
It is composed of one encoder for feature extraction and three decoders to handle the specific tasks.
Our model performs extremely well on the challenging BDD100K dataset, achieving state-of-the-art on all three tasks in terms of accuracy and speed.
arXiv Detail & Related papers (2021-08-25T14:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.