Related papers: LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

URL: http://arxiv.org/abs/2603.01928v1
Date: Mon, 02 Mar 2026 14:42:36 GMT
Title: LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving
Authors: Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen,
Abstract summary: Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning.<n>Their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts.<n>Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space.
Score: 21.38662345656532
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. \method~setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

Related papers

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction [0.0]
VLMFusionOcc3D is a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving.<n>We introduce Weather-Aware Adaptive Fusion, a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions.<n>Our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
arXiv Detail & Related papers (2026-03-03T05:22:28Z)
OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving [20.266736153749417]
Vision-Language-Action (VLA) models offer promising capabilities for autonomous driving through multimodal understanding.<n>Their utilization in safety-critical scenarios is constrained by inherent limitations, including numerical reasoning, weak 3D spatial awareness, and high sensitivity to context.<n>We propose HiST-VLA, a novel Hierarchical Spatio-Temporal VLA model designed for reliable trajectory generation.
arXiv Detail & Related papers (2026-02-11T07:08:33Z)
RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z)
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z)
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving [44.008287454538596]
Vision-language models (VLMs) enrich this paradigm by introducing cross-modal priors and commonsense reasoning.<n>Current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment.<n>We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder.
arXiv Detail & Related papers (2025-12-28T14:06:37Z)
TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking [30.955088934475928]
We present TrackVLA++, a novel model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and atemporal Identification Memory (TIM)<n>TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings.
arXiv Detail & Related papers (2025-10-08T15:29:17Z)
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z)
FlowDrive: Energy Flow Field for End-to-End Autonomous Driving [50.89871153094958]
FlowDrive is a novel framework that introduces physically interpretable energy-based flow fields to encode semantic priors and safety cues into the BEV space.<n> Experiments on the NAVSIM v2 benchmark demonstrate that FlowDrive achieves state-of-the-art performance with anS of 86.3, surpassing prior baselines in both safety and planning quality.
arXiv Detail & Related papers (2025-09-17T13:51:33Z)
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z)
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving [19.81442567260658]
We propose a visual-temporalT framework that enables VLAs to think in images.<n>On nuScenes and NAVSIM, FSDrive improves accuracy and reduces collisions.
arXiv Detail & Related papers (2025-05-23T09:55:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.