Related papers: SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

URL: http://arxiv.org/abs/2503.09594v1
Date: Wed, 12 Mar 2025 17:58:06 GMT
Title: SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
Authors: Katrin Renz, Long Chen, Elahe Arani, Oleg Sinavski,
Abstract summary: We propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment.<n>Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR.
Score: 15.223886922912842
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

Related papers

MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding [7.093473654069259]
We propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities.<n>Experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning.<n>The model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications.
arXiv Detail & Related papers (2025-07-08T15:14:53Z)
A Survey on Vision-Language-Action Models for Autonomous Driving [26.407082158880204]
Vision-Language-Action (VLA) paradigms integrate visual perception, natural language understanding, and control within a single policy.<n>Researchers in autonomous driving are actively adapting these methods to the vehicle domain.<n>This survey offers the first comprehensive overview of VLA for Autonomous Driving.
arXiv Detail & Related papers (2025-06-30T16:50:02Z)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z)
Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning [24.511628941825116]
We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning framework framework.<n>It reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control.<n>It achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.
arXiv Detail & Related papers (2025-02-19T09:50:44Z)
Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving [20.33096710167997]
generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving.<n>Cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories.<n>It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.
arXiv Detail & Related papers (2025-01-15T15:20:46Z)
MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving [10.74799483937468]
Vision-language models (VLMs) serve as general-purpose end-to-end models in autonomous driving. Most existing methods rely on computationally expensive visual encoders and large language models (LLMs) We propose a novel framework called MiniDrive, which incorporates our proposed Feature Engineering Mixture of Experts (FE-MoE) module and Dynamic Instruction Adapter (DI-Adapter)
arXiv Detail & Related papers (2024-09-11T13:43:01Z)
OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [12.004183122121042]
OccLLaMA is an occupancy-language-action generative world model. We build a unified multi-modal vocabulary for vision, language and action. OccLLaMA achieves competitive performance across multiple tasks.
arXiv Detail & Related papers (2024-09-05T06:30:01Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z)
Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving [0.0]
We develop an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving. In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations.
arXiv Detail & Related papers (2024-03-28T21:18:33Z)
VLP: Vision Language Planning for Autonomous Driving [52.640371249017335]
This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
arXiv Detail & Related papers (2024-01-10T23:00:40Z)
Personalized Autonomous Driving with Large Language Models: Field Experiments [11.429053835807697]
We introduce an LLM-based framework, Talk2Drive, capable of translating natural verbal commands into executable controls. This is the first-of-its-kind multi-scenario field experiment that deploys LLMs on a real-world autonomous vehicle. We validate that the proposed memory module considers personalized preferences and further reduces the takeover rate by up to 65.2%.
arXiv Detail & Related papers (2023-12-14T23:23:37Z)
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs) DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest. To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model. Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.