Related papers: Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework

Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework

URL: http://arxiv.org/abs/2512.06376v1
Date: Sat, 06 Dec 2025 10:06:27 GMT
Title: Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework
Authors: Xinhao Xiang, Abhijeet Rastogi, Jiawei Zhang,
Abstract summary: Recent text-to-video models have enabled the generation of high-resolution driving scenes from natural language prompts.<n>These AI-generated driving videos (AIGVs) offer a low-cost, scalable alternative to real or simulator data for autonomous driving (AD)<n>But a key question remains: can such videos reliably support training and evaluation of AD models?<n>We present a diagnostic framework that systematically studies this question.
Score: 5.557926430369991
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent text-to-video models have enabled the generation of high-resolution driving scenes from natural language prompts. These AI-generated driving videos (AIGVs) offer a low-cost, scalable alternative to real or simulator data for autonomous driving (AD). But a key question remains: can such videos reliably support training and evaluation of AD models? We present a diagnostic framework that systematically studies this question. First, we introduce a taxonomy of frequent AIGV failure modes, including visual artifacts, physically implausible motion, and violations of traffic semantics, and demonstrate their negative impact on object detection, tracking, and instance segmentation. To support this analysis, we build ADGV-Bench, a driving-focused benchmark with human quality annotations and dense labels for multiple perception tasks. We then propose ADGVE, a driving-aware evaluator that combines static semantics, temporal cues, lane obedience signals, and Vision-Language Model(VLM)-guided reasoning into a single quality score for each clip. Experiments show that blindly adding raw AIGVs can degrade perception performance, while filtering them with ADGVE consistently improves both general video quality assessment metrics and downstream AD models, and turns AIGVs into a beneficial complement to real-world data. Our study highlights both the risks and the promise of AIGVs, and provides practical tools for safely leveraging large-scale video generation in future AD pipelines.

Related papers

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos [20.73513310337503]
Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving.<n>We propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos.
arXiv Detail & Related papers (2026-02-25T16:38:53Z)
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving [49.11389494068169]
We present DrivingGen, the first comprehensive benchmark for generative driving world models.<n>DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources.<n>General models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality.
arXiv Detail & Related papers (2026-01-04T13:36:21Z)
Towards Safer and Understandable Driver Intention Prediction [30.136400523083907]
We introduce the task of interpretability in maneuver prediction before they occur for driver safety.<n>To foster research in interpretable DIP, we curate the DAAD-X, a new multimodal, ego-centric video dataset.<n>Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates coherent explanations inherently.
arXiv Detail & Related papers (2025-10-10T09:41:25Z)
AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving [31.127210974372456]
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning.<n>We introduce textbfAgentThink, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks.
arXiv Detail & Related papers (2025-05-21T09:27:43Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM [2.5988879420706095]
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision.<n>Existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments.<n>This study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model.
arXiv Detail & Related papers (2025-03-06T14:52:34Z)
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [17.36342349850825]
Vision-language models (VLMs) as teachers to enhance training by providing additional supervision.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z)
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts. We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z)
DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving [12.004604110512421]
Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. We propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them.
arXiv Detail & Related papers (2024-08-29T15:52:56Z)
GAIA: Rethinking Action Quality Assessment for AI-Generated Videos [56.047773400426486]
Action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features. We construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively.
arXiv Detail & Related papers (2024-06-10T08:18:07Z)
Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases [102.05741859030951]
We propose CODA-LM, the first benchmark for the automatic evaluation of LVLMs for self-driving corner cases.<n>We show that using the text-only large language models as judges reveals even better alignment with human preferences than the LVLM judges.<n>Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task.
arXiv Detail & Related papers (2024-04-16T14:20:55Z)
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.