Related papers: SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability

SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability

URL: http://arxiv.org/abs/2506.14144v1
Date: Tue, 17 Jun 2025 03:11:31 GMT
Title: SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability
Authors: Juho Bai, Inwook Shim,
Abstract summary: SceneAware is a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy.<n>We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints.<n>Our analysis shows that the model performs consistently well across various types of pedestrian movement.
Score: 3.130722489512822
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Accurate prediction of pedestrian trajectories is essential for applications in robotics and surveillance systems. While existing approaches primarily focus on social interactions between pedestrians, they often overlook the rich environmental context that significantly shapes human movement patterns. In this paper, we propose SceneAware, a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy. Our method leverages a Vision Transformer~(ViT) scene encoder to process environmental context from static scene images, while Multi-modal Large Language Models~(MLLMs) generate binary walkability masks that distinguish between accessible and restricted areas during training. We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints. The framework integrates collision penalty mechanisms that discourage predicted trajectories from violating physical boundaries, ensuring physically plausible predictions. SceneAware is implemented in both deterministic and stochastic variants. Comprehensive experiments on the ETH/UCY benchmark datasets show that our approach outperforms state-of-the-art methods, with more than 50\% improvement over previous models. Our analysis based on different trajectory categories shows that the model performs consistently well across various types of pedestrian movement. This highlights the importance of using explicit scene information and shows that our scene-aware approach is both effective and reliable in generating accurate and physically plausible predictions. Code is available at: https://github.com/juho127/SceneAware.

Related papers

COME: Adding Scene-Centric Forecasting Control to Occupancy World Model [18.815436110557112]
World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data.<n>Existing methods struggle to disentangle ego-vehicle motion (perspective shifts from scene evolvement)<n>We propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems.
arXiv Detail & Related papers (2025-06-16T09:01:09Z)
Steerable Scene Generation with Post Training and Inference-Time Search [24.93360616245269]
Training robots in simulation requires diverse 3D scenes that reflect specific challenges of downstream tasks.<n>We generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation.<n>We release a dataset of over 44 million SE(3) scenes spanning five diverse environments.
arXiv Detail & Related papers (2025-05-07T22:07:42Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Unified Human Localization and Trajectory Prediction with Monocular Vision [64.19384064365431]
MonoTransmotion is a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks.<n>We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs.
arXiv Detail & Related papers (2025-03-05T14:18:39Z)
ASTRA: A Scene-aware TRAnsformer-based model for trajectory prediction [15.624698974735654]
ASTRA (A Scene-aware TRAnsformer-based model for trajectory prediction) is a light-weight pedestrian trajectory forecasting model.<n>We utilise a U-Net-based feature extractor, via its latent vector representation, to capture scene representations and a graph-aware transformer encoder for capturing social interactions.
arXiv Detail & Related papers (2025-01-16T23:28:30Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
MoST: Multi-modality Scene Tokenization for Motion Prediction [39.97334929667033]
We propose tokenizing the visual world into a compact set of scene elements. We then leverage pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens.
arXiv Detail & Related papers (2024-04-30T13:09:41Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.<n>Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z)
COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos [62.34712951567793]
The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics. We introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras. We propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously.
arXiv Detail & Related papers (2022-10-04T17:49:23Z)
LOPR: Latent Occupancy PRediction using Generative Models [49.15687400958916]
LiDAR generated occupancy grid maps (L-OGMs) offer a robust bird's eye-view scene representation. We propose a framework that decouples occupancy prediction into: representation learning and prediction within the learned latent space.
arXiv Detail & Related papers (2022-10-03T22:04:00Z)
Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics. The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects. Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z)
MCENET: Multi-Context Encoder Network for Homogeneous Agent Trajectory Prediction in Mixed Traffic [35.22312783822563]
Trajectory prediction in urban mixedtraffic zones is critical for many intelligent transportation systems. We propose an approach named Multi-Context Network (MCENET) that is trained by encoding both past and future scene context. In inference time, we combine the past context and motion information of the target agent with samplings of the latent variables to predict multiple realistic trajectories.
arXiv Detail & Related papers (2020-02-14T11:04:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.