KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models
- URL: http://arxiv.org/abs/2509.02966v1
- Date: Wed, 03 Sep 2025 03:10:42 GMT
- Title: KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models
- Authors: Yujin Wang, Tianyi Wang, Quanfeng Liu, Wenxian Fan, Junfeng Jiao, Christian Claudel, Yunbing Yan, Bingzhao Gao, Jianqiang Wang, Hong Chen,
- Abstract summary: This paper introduces KEPT, a knowledge-enhanced vision-language framework.<n>It predicts ego trajectories directly from consecutive front-view driving frames.<n>It achieves state-of-the-art performance across open-loop protocols.
- Score: 19.625631486595505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate short-horizon trajectory prediction is pivotal for safe and reliable autonomous driving, yet existing vision-language models (VLMs) often fail to effectively ground their reasoning in scene dynamics and domain knowledge. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT couples a temporal frequency-spatial fusion (TFSF) video encoder, trained via self-supervised learning with hard-negative mining, with a scalable k-means + HNSW retrieval stack that supplies scene-aligned exemplars. Retrieved priors are embedded into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning schedule incrementally aligns the language head to metric spatial cues, physically feasible motion, and temporally conditioned front-view planning. Evaluated on nuScenes dataset, KEPT achieves state-of-the-art performance across open-loop protocols: under NoAvg, it achieves 0.70m average L2 with a 0.21\% collision rate; under TemAvg with lightweight ego status, it attains 0.31m average L2 and a 0.07\% collision rate. Ablation studies show that all three fine-tuning stages contribute complementary benefits, and that using Top-2 retrieved exemplars yields the best accuracy-safety trade-off. The k-means-clustered HNSW index delivers sub-millisecond retrieval latency, supporting practical deployment. These results indicate that retrieval-augmented, CoT-guided VLMs offer a promising, data-efficient pathway toward interpretable and trustworthy autonomous driving.
Related papers
- AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving [37.260140808367716]
We propose AutoDrive-R$2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems.<n>We first propose an innovative CoT dataset named nuScenesR$2$-6K for supervised fine-tuning.<n>We then employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework to ensure reliable smoothness and realistic trajectory planning.
arXiv Detail & Related papers (2025-09-02T04:32:24Z) - SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving [5.343552118560704]
SpaRC-AD is a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving.<n>Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks.
arXiv Detail & Related papers (2025-08-14T12:02:41Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - DriveMind: A Dual-VLM based Reinforcement Learning Framework for Autonomous Driving [14.988477212106018]
DriveMind is a semantic reward framework for autonomous driving.<n>We show it can achieve 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions.<n>Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift.
arXiv Detail & Related papers (2025-06-01T03:51:09Z) - SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving [51.47621083057114]
SOLVE is an innovative framework that synergizes Vision-Language Models with end-to-end (E2E) models to enhance autonomous vehicle planning.<n>Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components.
arXiv Detail & Related papers (2025-05-22T15:44:30Z) - SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models [9.304973961799359]
Vision-language models (VLMs) play a crucial role in enhancing scenario comprehension.<n>They face challenges, such as hallucination and insufficient real-world grounding.<n>In this work, RAC3 is proposed to enhance the performance of VLMs in corner case comprehension.
arXiv Detail & Related papers (2024-12-15T04:51:30Z) - HE-Drive: Human-Like End-to-End Driving with Vision Language Models [11.845309076856365]
We propose HE-Drive: the first human-like-centric end-to-end autonomous driving system.
We show that HE-Drive achieves state-of-the-art performance (i.e., reduces the average collision rate by 71% than VAD) and efficiency (i.e., 1.9X faster than SparseDrive) on datasets.
arXiv Detail & Related papers (2024-10-07T14:06:16Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving [59.94343412438211]
We introduce the GPT style next token motion prediction into motion prediction.
Different from language data which is composed of homogeneous units -words, the elements in the driving scene could have complex spatial-temporal and semantic relations.
We propose to adopt three factorized attention modules with different neighbors for information aggregation and different position encoding styles to capture their relations.
arXiv Detail & Related papers (2024-03-20T06:22:37Z) - SEPT: Towards Efficient Scene Representation Learning for Motion
Prediction [19.111948522155004]
This paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful models for complex traffic scenes.
experiments demonstrate that SEPT, without elaborate architectural design or feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks.
arXiv Detail & Related papers (2023-09-26T21:56:03Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.