Related papers: Pedestrian Intention Prediction via Vision-Language Foundation Models

Pedestrian Intention Prediction via Vision-Language Foundation Models

URL: http://arxiv.org/abs/2507.04141v1
Date: Sat, 05 Jul 2025 19:39:00 GMT
Title: Pedestrian Intention Prediction via Vision-Language Foundation Models
Authors: Mohsen Azarmi, Mahdi Rezaei, He Wang,
Abstract summary: This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions.<n>The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts.<n>Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%.
Score: 10.351342371371675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.

Related papers

Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving [20.33096710167997]
generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving.<n>Cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories.<n>It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.
arXiv Detail & Related papers (2025-01-15T15:20:46Z)
Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving [5.456780031044544]
We propose a knowledge distillation method that transfers knowledge from large-scale vision-language foundation models to efficient vision networks.<n>We apply it to pedestrian behavior prediction and scene understanding tasks, achieving promising results in generating more diverse and comprehensive semantic attributes.
arXiv Detail & Related papers (2025-01-12T01:31:07Z)
FollowGen: A Scaled Noise Conditional Diffusion Model for Car-Following Trajectory Prediction [9.2729178775419]
This study introduces a scaled noise conditional diffusion model for car-following trajectory prediction. It integrates detailed inter-vehicular interactions and car-following dynamics into a generative framework, improving the accuracy and plausibility of predicted trajectories. Experimental results on diverse real-world driving scenarios demonstrate the state-of-the-art performance and robustness of the proposed method.
arXiv Detail & Related papers (2024-11-23T23:13:45Z)
Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.<n> LAW predicts future scene features based on current features and ego trajectories.<n>This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z)
RAG-based Explainable Prediction of Road Users Behaviors for Automated Driving using Knowledge Graphs and Large Language Models [8.253092044813595]
We propose an explainable road users' behavior prediction system that integrates the reasoning abilities of Knowledge Graphs and Large Language Models. Two use cases have been implemented following the proposed approach: 1) Prediction of pedestrians' crossing actions; 2) Prediction of lane change maneuvers.
arXiv Detail & Related papers (2024-05-01T11:06:31Z)
Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach [54.429396802848224]
This paper proposes an interpretable generative model for motion prediction with robust generalizability to out-of-distribution cases. For interpretability, the model achieves the target-driven motion prediction by estimating the spatial distribution of long-term destinations. Experiments on motion prediction datasets validate that the fitted model can be interpretable and generalizable.
arXiv Detail & Related papers (2024-03-10T04:16:04Z)
The Integration of Prediction and Planning in Deep Learning Automated Driving Systems: A Review [43.30610493968783]
We review state-of-the-art deep learning-based planning systems, and focus on how they integrate prediction. We discuss the implications, strengths, and limitations of different integration principles.
arXiv Detail & Related papers (2023-08-10T17:53:03Z)
LOPR: Latent Occupancy PRediction using Generative Models [49.15687400958916]
LiDAR generated occupancy grid maps (L-OGMs) offer a robust bird's eye-view scene representation. We propose a framework that decouples occupancy prediction into: representation learning and prediction within the learned latent space.
arXiv Detail & Related papers (2022-10-03T22:04:00Z)
AdvDO: Realistic Adversarial Attacks for Trajectory Prediction [87.96767885419423]
Trajectory prediction is essential for autonomous vehicles to plan correct and safe driving behaviors. We devise an optimization-based adversarial attack framework to generate realistic adversarial trajectories. Our attack can lead an AV to drive off road or collide into other vehicles in simulation.
arXiv Detail & Related papers (2022-09-19T03:34:59Z)
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning [132.20119288212376]
We propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system.
arXiv Detail & Related papers (2022-07-15T16:57:43Z)
Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable. We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z)
Causal-based Time Series Domain Generalization for Vehicle Intention Prediction [19.944268567657307]
Accurately predicting possible behaviors of traffic participants is an essential capability for autonomous vehicles. In this paper, we aim to address the domain generalization problem for vehicle intention prediction tasks. Our proposed method has consistent improvement on prediction accuracy compared to other state-of-the-art domain generalization and behavior prediction methods.
arXiv Detail & Related papers (2021-12-03T18:58:07Z)
Attentional-GCNN: Adaptive Pedestrian Trajectory Prediction towards Generic Autonomous Vehicle Use Cases [10.41902340952981]
We propose a novel Graph Convolutional Neural Network (GCNN)-based approach, Attentional-GCNN, which aggregates information of implicit interaction between pedestrians in a crowd by assigning attention weight in edges of the graph. We show our proposed method achieves an improvement over the state of art by 10% Average Displacement Error (ADE) and 12% Final Displacement Error (FDE) with fast inference speeds.
arXiv Detail & Related papers (2020-11-23T03:13:26Z)
The Importance of Prior Knowledge in Precise Multimodal Prediction [71.74884391209955]
Roads have well defined geometries, topologies, and traffic rules. In this paper we propose to incorporate structured priors as a loss function. We demonstrate the effectiveness of our approach on real-world self-driving datasets.
arXiv Detail & Related papers (2020-06-04T03:56:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.