Related papers: FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

URL: http://arxiv.org/abs/2505.17685v2
Date: Wed, 29 Oct 2025 12:46:23 GMT
Title: FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
Authors: Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei,
Abstract summary: We propose a visual-temporalT framework that enables VLAs to think in images.<n>On nuScenes and NAVSIM, FSDrive improves accuracy and reduces collisions.
Score: 19.81442567260658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors-future lane dividers and 3D boxes-on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.

Related papers

Chain of World: World Model Thinking in Latent Motion [24.24061036481793]
Vision-Language-Action (VLA) models often overlook the predictive and temporal-causal structure underlying visual dynamics.<n>We introduce CoWVLA, a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation.<n>CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency.
arXiv Detail & Related papers (2026-03-03T17:52:06Z)
Generative Scenario Rollouts for End-to-End Autonomous Driving [58.99809446189301]
Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems.<n>We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes.
arXiv Detail & Related papers (2026-01-16T17:59:28Z)
UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving [29.623672055601418]
We propose a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation.<n>Experiments on the Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method.
arXiv Detail & Related papers (2026-01-07T23:49:52Z)
FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model [103.2513470454204]
FutureX is a pipeline that enhances end-to-end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement.<n>FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency.
arXiv Detail & Related papers (2025-12-12T02:12:49Z)
Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution [96.25314747309811]
We introduce SeerDrive, a novel end-to-end framework that jointly models future scene evolution and trajectory planning.<n>Our method first predicts future bird's-eye view (BEV) representations to anticipate the dynamics of the surrounding scene.<n>Two key components enable this: (1) future-aware planning, which injects predicted BEV features into the trajectory planner, and (2) iterative scene modeling and vehicle planning.
arXiv Detail & Related papers (2025-10-13T07:41:47Z)
FlowDrive: Energy Flow Field for End-to-End Autonomous Driving [50.89871153094958]
FlowDrive is a novel framework that introduces physically interpretable energy-based flow fields to encode semantic priors and safety cues into the BEV space.<n> Experiments on the NAVSIM v2 benchmark demonstrate that FlowDrive achieves state-of-the-art performance with anS of 86.3, surpassing prior baselines in both safety and planning quality.
arXiv Detail & Related papers (2025-09-17T13:51:33Z)
KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models [19.625631486595505]
This paper introduces KEPT, a knowledge-enhanced vision-language framework.<n>It predicts ego trajectories directly from consecutive front-view driving frames.<n>It achieves state-of-the-art performance across open-loop protocols.
arXiv Detail & Related papers (2025-09-03T03:10:42Z)
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning [75.80110543049783]
We propose FastDriveVLA, a reconstruction-based vision token pruning framework for autonomous driving.<n>A novel foreground adversarial-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models.<n>Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.
arXiv Detail & Related papers (2025-07-31T07:55:56Z)
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning [42.409352964719204]
Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving.<n>Current VLA models struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning.<n>We propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model.
arXiv Detail & Related papers (2025-06-16T17:58:50Z)
DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving [20.197094443215963]
We present DriveX, a self-supervised world model that learns general scene dynamics and holistic representations from driving videos.<n>DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation.<n>For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates features from DriveX's predictions to enhance task-specific inference.
arXiv Detail & Related papers (2025-05-25T17:27:59Z)
HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning [12.568968115955865]
We propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly.<n>We show that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.
arXiv Detail & Related papers (2025-05-21T16:16:52Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
A Survey of World Models for Autonomous Driving [63.33363128964687]
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling.<n>World models offer high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics.<n>This paper systematically reviews recent advances in world models for autonomous driving.
arXiv Detail & Related papers (2025-01-20T04:00:02Z)
Multi-scale Temporal Fusion Transformer for Incomplete Vehicle Trajectory Prediction [23.72022120344089]
Motion prediction plays an essential role in autonomous driving systems. We propose a novel end-to-end framework for incomplete vehicle trajectory prediction. We evaluate the proposed model on four datasets derived from highway and urban traffic scenarios.
arXiv Detail & Related papers (2024-09-02T02:36:18Z)
BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.<n>The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z)
HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention [76.37139809114274]
HPNet is a novel dynamic trajectory forecasting method. We propose a Historical Prediction Attention module to automatically encode the dynamic relationship between successive predictions. Our code is available at https://github.com/XiaolongTang23/HPNet.
arXiv Detail & Related papers (2024-04-09T14:42:31Z)
AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving [59.94343412438211]
We introduce the GPT style next token motion prediction into motion prediction. Different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations. We propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations.
arXiv Detail & Related papers (2024-03-20T06:22:37Z)
GenAD: Generative End-to-End Autonomous Driving [13.332272121018285]
GenAD is a generative framework that casts autonomous driving into a generative modeling problem. We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling.
arXiv Detail & Related papers (2024-02-18T08:21:05Z)
Context-Aware Timewise VAEs for Real-Time Vehicle Trajectory Prediction [4.640835690336652]
We present ContextVAE, a context-aware approach for multi-modal vehicle trajectory prediction. Our approach takes into account both the social features exhibited by agents on the scene and the physical environment constraints. In all tested datasets, ContextVAE models are fast to train and provide high-quality multi-modal predictions in real-time.
arXiv Detail & Related papers (2023-02-21T18:42:24Z)
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z)
CARNet: A Dynamic Autoencoder for Learning Latent Dynamics in Autonomous Driving Tasks [11.489187712465325]
An autonomous driving system should effectively use the information collected from the various sensors in order to form an abstract description of the world. Deep learning models, such as autoencoders, can be used for that purpose, as they can learn compact latent representations from a stream of incoming data. This work proposes CARNet, a Combined dynAmic autoencodeR NETwork architecture that utilizes an autoencoder combined with a recurrent neural network to learn the current latent representation.
arXiv Detail & Related papers (2022-05-18T04:15:42Z)
Implicit Latent Variable Model for Scene-Consistent Motion Forecasting [78.74510891099395]
In this paper, we aim to learn scene-consistent motion forecasts of complex urban traffic directly from sensor data. We model the scene as an interaction graph and employ powerful graph neural networks to learn a distributed latent representation of the scene.
arXiv Detail & Related papers (2020-07-23T14:31:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.