Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
- URL: http://arxiv.org/abs/2508.00589v1
- Date: Fri, 01 Aug 2025 12:41:52 GMT
- Title: Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
- Authors: Stefan Englmeier, Max A. Büttner, Katharina Winter, Fabian B. Flohr,
- Abstract summary: We propose a novel context-aware motion retrieval framework to support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios.<n>Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.
- Score: 0.5249805590164902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous driving systems must operate reliably in safety-critical scenarios, particularly those involving unusual or complex behavior by Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets is essential for robust evaluation and generalization, but retrieving such rare human behavior scenarios within the long tail of large-scale datasets is challenging. To support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios, we propose a novel context-aware motion retrieval framework. Our method combines Skinned Multi-Person Linear (SMPL)-based motion sequences and corresponding video frames before encoding them into a shared multimodal embedding space aligned with natural language. Our approach enables the scalable retrieval of human behavior and their context through text queries. This work also introduces our dataset WayMoCo, an extension of the Waymo Open Dataset. It contains automatically labeled motion and scene context descriptions derived from generated pseudo-ground-truth SMPL sequences and corresponding image data. Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.
Related papers
- LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [94.84458417662404]
LangTraj is a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios.<n>By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors.<n>LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation.
arXiv Detail & Related papers (2025-04-15T17:14:06Z) - ROAD-Waymo: Action Awareness at Scale for Autonomous Driving [17.531603453254434]
ROAD-Waymo is an extensive dataset for the development and benchmarking of techniques for agent, action, location and event detection in road scenes.
Considerably larger and more challenging than any existing dataset (and encompassing multiple cities), it comes with 198k annotated video frames, 54k agent tubes, 3.9M bounding boxes and a total of 12.4M labels.
arXiv Detail & Related papers (2024-11-03T20:46:50Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours.<n>This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z) - Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description.
Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models.
We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously.
We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving [46.84729450920804]
We propose the BEV-TSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird's Eye View space.
We employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding.
Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-text and text-to-scene
arXiv Detail & Related papers (2024-01-02T06:56:23Z) - Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs.
We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios.
Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z) - Context-Aware Timewise VAEs for Real-Time Vehicle Trajectory Prediction [4.640835690336652]
We present ContextVAE, a context-aware approach for multi-modal vehicle trajectory prediction.
Our approach takes into account both the social features exhibited by agents on the scene and the physical environment constraints.
In all tested datasets, ContextVAE models are fast to train and provide high-quality multi-modal predictions in real-time.
arXiv Detail & Related papers (2023-02-21T18:42:24Z) - Learning Interactive Driving Policies via Data-driven Simulation [125.97811179463542]
Data-driven simulators promise high data-efficiency for driving policy learning.
Small underlying datasets often lack interesting and challenging edge cases for learning interactive driving.
We propose a simulation method that uses in-painted ado vehicles for learning robust driving policies.
arXiv Detail & Related papers (2021-11-23T20:14:02Z) - SoDA: Multi-Object Tracking with Soft Data Association [75.39833486073597]
Multi-object tracking (MOT) is a prerequisite for a safe deployment of self-driving cars.
We propose a novel approach to MOT that uses attention to compute track embeddings that encode dependencies between observed objects.
arXiv Detail & Related papers (2020-08-18T03:40:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.