Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
- URL: http://arxiv.org/abs/2405.20991v1
- Date: Fri, 31 May 2024 16:35:41 GMT
- Title: Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
- Authors: Yi Yang, Qingwen Zhang, Kei Ikemura, Nazre Batool, John Folkesson,
- Abstract summary: This work explores the potential of Vision-Language Foundation Models (VLMs) in detecting hard cases in autonomous driving.
We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios.
We show the effectiveness and feasibility of incorporating our pipeline with state-of-the-art methods on NuScenes datasets.
- Score: 16.452638202694246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Addressing hard cases in autonomous driving, such as anomalous road users, extreme weather conditions, and complex traffic interactions, presents significant challenges. To ensure safety, it is crucial to detect and manage these scenarios effectively for autonomous driving systems. However, the rarity and high-risk nature of these cases demand extensive, diverse datasets for training robust models. Vision-Language Foundation Models (VLMs) have shown remarkable zero-shot capabilities as being trained on extensive datasets. This work explores the potential of VLMs in detecting hard cases in autonomous driving. We demonstrate the capability of VLMs such as GPT-4v in detecting hard cases in traffic participant motion prediction on both agent and scenario levels. We introduce a feasible pipeline where VLMs, fed with sequential image frames with designed prompts, effectively identify challenging agents or scenarios, which are verified by existing prediction models. Moreover, by taking advantage of this detection of hard cases by VLMs, we further improve the training efficiency of the existing motion prediction pipeline by performing data selection for the training samples suggested by GPT. We show the effectiveness and feasibility of our pipeline incorporating VLMs with state-of-the-art methods on NuScenes datasets. The code is accessible at https://github.com/KTH-RPL/Detect_VLM.
Related papers
- CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours.
This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z) - Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events [5.233512464561313]
Multimodal Large Language Models (MLLMs) offer a novel approach by integrating textual, visual, and audio modalities.
Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts.
Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis.
arXiv Detail & Related papers (2024-06-19T23:50:41Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - A Diffusion-Model of Joint Interactive Navigation [14.689298253430568]
We present DJINN - a diffusion based method of generating traffic scenarios.
Our approach jointly diffuses the trajectories of all agents, conditioned on a flexible set of state observations from the past, present, or future.
We show how DJINN flexibly enables direct test-time sampling from a variety of valuable conditional distributions.
arXiv Detail & Related papers (2023-09-21T22:10:20Z) - Unsupervised Domain Adaptation for Self-Driving from Past Traversal
Features [69.47588461101925]
We propose a method to adapt 3D object detectors to new driving environments.
Our approach enhances LiDAR-based detection models using spatial quantized historical features.
Experiments on real-world datasets demonstrate significant improvements.
arXiv Detail & Related papers (2023-09-21T15:00:31Z) - Unsupervised Self-Driving Attention Prediction via Uncertainty Mining
and Knowledge Embedding [51.8579160500354]
We propose an unsupervised way to predict self-driving attention by uncertainty modeling and driving knowledge integration.
Results show equivalent or even more impressive performance compared to fully-supervised state-of-the-art approaches.
arXiv Detail & Related papers (2023-03-17T00:28:33Z) - Detection of Active Emergency Vehicles using Per-Frame CNNs and Output
Smoothing [4.917229375785646]
Inferring common actor states (such as position or velocity) is an important and well-explored task of the perception system aboard a self-driving vehicle.
This is especially true in the case of active emergency vehicles (EVs), where light-based signals also need to be captured to provide a full context.
We propose a sequential methodology for the detection of active EVs, using an off-the-shelf CNN model operating at a frame level and a downstream smoother that accounts for the temporal aspect of flashing EV lights.
arXiv Detail & Related papers (2022-12-28T04:45:51Z) - Learning energy-efficient driving behaviors by imitating experts [75.12960180185105]
This paper examines the role of imitation learning in bridging the gap between control strategies and realistic limitations in communication and sensing.
We show that imitation learning can succeed in deriving policies that, if adopted by 5% of vehicles, may boost the energy-efficiency of networks with varying traffic conditions by 15% using only local observations.
arXiv Detail & Related papers (2022-06-28T17:08:31Z) - VISTA 2.0: An Open, Data-driven Simulator for Multimodal Sensing and
Policy Learning for Autonomous Vehicles [131.2240621036954]
We present VISTA, an open source, data-driven simulator that integrates multiple types of sensors for autonomous vehicles.
Using high fidelity, real-world datasets, VISTA represents and simulates RGB cameras, 3D LiDAR, and event-based cameras.
We demonstrate the ability to train and test perception-to-control policies across each of the sensor types and showcase the power of this approach via deployment on a full scale autonomous vehicle.
arXiv Detail & Related papers (2021-11-23T18:58:10Z) - WiP Abstract : Robust Out-of-distribution Motion Detection and
Localization in Autonomous CPS [3.464656011246703]
We propose a robust out-of-distribution (OOD) detection framework for deep learning.
Our approach detects unusual movements from driving video in real-time by combining classical optic flow operation with representation learning.
Evaluation on a driving simulation data set shows that our approach is statistically more robust than related works.
arXiv Detail & Related papers (2021-07-25T06:20:05Z) - Generating and Characterizing Scenarios for Safety Testing of Autonomous
Vehicles [86.9067793493874]
We propose efficient mechanisms to characterize and generate testing scenarios using a state-of-the-art driving simulator.
We use our method to characterize real driving data from the Next Generation Simulation (NGSIM) project.
We rank the scenarios by defining metrics based on the complexity of avoiding accidents and provide insights into how the AV could have minimized the probability of incurring an accident.
arXiv Detail & Related papers (2021-03-12T17:00:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.