EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations
- URL: http://arxiv.org/abs/2510.00405v1
- Date: Wed, 01 Oct 2025 01:30:13 GMT
- Title: EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations
- Authors: Jiayi Liu, Jiaming Zhou, Ke Ye, Kun-Yu Lin, Allan Wang, Junwei Liang,
- Abstract summary: We introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird's-eye-view future trajectories.<n>We propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion.<n>BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness.
- Score: 28.981146701183448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume idealized observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, the first real-world benchmark that grounds noisy, first-person visual histories in clean, bird's-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion by leveraging a shared latent representation. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for developing trajectory forecasting systems truly resilient to the challenges of real-world, ego-centric perception.
Related papers
- ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation [46.30718574969354]
egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames.<n>We propose ARGaze, which reformulates gaze estimation as sequential prediction.<n>We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation.
arXiv Detail & Related papers (2026-02-04T23:33:16Z) - Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z) - Ego-centric Predictive Model Conditioned on Hand Trajectories [52.531681772560724]
In egocentric scenarios, anticipating both the next action and its visual outcome is essential for understanding human-object interactions.<n>We propose a unified two-stage predictive framework that jointly models action and visual future in egocentric scenarios.<n>Our approach is the first unified model designed to handle both egocentric human activity understanding and robotic manipulation tasks.
arXiv Detail & Related papers (2025-08-27T13:09:55Z) - Consistent World Models via Foresight Diffusion [56.45012929930605]
We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability.<n>We propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising.
arXiv Detail & Related papers (2025-05-22T10:01:59Z) - BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents [56.33989853438012]
We propose BEVWorld, a framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View latent space for holistic environment modeling.<n>The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model.
arXiv Detail & Related papers (2024-07-08T07:26:08Z) - AdaOcc: Adaptive Forward View Transformation and Flow Modeling for 3D Occupancy and Flow Prediction [56.72301849123049]
We present our solution for the Vision-Centric 3D Occupancy and Flow Prediction track in the nuScenes Open-Occ dataset challenge at CVPR 2024.
Our innovative approach involves a dual-stage framework that enhances 3D occupancy and flow predictions by incorporating adaptive forward view transformation and flow modeling.
Our method combines regression with classification to address scale variations in different scenes, and leverages predicted flow to warp current voxel features to future frames, guided by future frame ground truth.
arXiv Detail & Related papers (2024-07-01T16:32:15Z) - Vectorized Representation Dreamer (VRD): Dreaming-Assisted Multi-Agent Motion-Forecasting [2.2020053359163305]
We introduce VRD, a vectorized world model-inspired approach to the multi-agent motion forecasting problem.
Our method combines a traditional open-loop training regime with a novel dreamed closed-loop training pipeline.
Our model achieves state-of-the-art performance on the single prediction miss rate metric.
arXiv Detail & Related papers (2024-06-20T15:34:17Z) - Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.<n> LAW predicts future scene features based on current features and ego trajectories.<n>This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z) - Self-Supervised Bird's Eye View Motion Prediction with Cross-Modality
Signals [38.20643428486824]
Learning the dense bird's eye view (BEV) motion flow in a self-supervised manner is an emerging research for robotics and autonomous driving.
Current self-supervised methods mainly rely on point correspondences between point clouds.
We introduce a novel cross-modality self-supervised training framework that effectively addresses these issues by leveraging multi-modality data.
arXiv Detail & Related papers (2024-01-21T14:09:49Z) - Learning Robust Representations via Bidirectional Transition for Visual Reinforcement Learning [49.23256535551141]
We introduce a Bidirectional Transition (BiT) model, which leverages the ability to bidirectionally predict environmental transitions both forward and backward to extract reliable representations.<n>Our model demonstrates competitive generalization performance and sample efficiency on two settings of the DeepMind Control suite.
arXiv Detail & Related papers (2023-12-04T14:19:36Z) - Smooth-Trajectron++: Augmenting the Trajectron++ behaviour prediction
model with smooth attention [0.0]
This work investigates the state-of-the-art trajectory forecasting model Trajectron++ which we enhance by incorporating a smoothing term in its attention module.
This attention mechanism mimics human attention inspired by cognitive science research indicating limits to attention switching.
We evaluate the performance of the resulting Smooth-Trajectron++ model and compare it to the original model on various benchmarks.
arXiv Detail & Related papers (2023-05-31T09:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.