Related papers: AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems

AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems

URL: http://arxiv.org/abs/2601.16964v1
Date: Fri, 23 Jan 2026 18:33:41 GMT
Title: AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems
Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah,
Abstract summary: This paper introduces AgentDrive, an open benchmark dataset containing 300,000 driving scenarios.<n>AgentDrive formalizes a factorized scenario space across seven axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density.<n>To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions.
Score: 3.099103925863002
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM-driven prompt-to-JSON pipeline generates semantically rich, simulation-ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule-based outcome labeling. To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large-scale evaluation of fifty leading LLMs on AgentDrive-MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics-grounded reasoning. We release the AgentDrive dataset, AgentDrive-MCQ benchmark, evaluation code, and related materials at https://github.com/maferrag/AgentDrive

Related papers

InfGen: Scenario Generation as Next Token Group Prediction [49.54222089551598]
InfGen is a scenario generation framework that outputs agent states and trajectories in an autoregressive manner.<n>Experiments demonstrate that InfGen produces realistic, diverse, and adaptive traffic behaviors.
arXiv Detail & Related papers (2025-06-29T16:18:32Z)
Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour [35.19786322586909]
We propose Agentic eXplanations via Interrogative Simulation (AXIS)<n>AXIS generates human-centred action explanations for multi-agent policies.<n>We evaluate AXIS on autonomous driving across ten scenarios for five LLMs.
arXiv Detail & Related papers (2025-05-23T12:19:18Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving [31.127210974372456]
Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning.<n>We introduce textbfAgentThink, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks.
arXiv Detail & Related papers (2025-05-21T09:27:43Z)
On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software [0.577182115743694]
Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle's environment.<n>Development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle.<n>This study developed and evaluated a prototype for automatic code generation and assessment.
arXiv Detail & Related papers (2025-04-02T21:35:11Z)
AutoDrive-QA: A Multiple-Choice Benchmark for Vision-Language Evaluation in Urban Autonomous Driving [0.7734726150561086]
We introduce AutoDrive-QA, the first benchmark that systematically converts open-ended driving QA into structured multiple-choice questions.<n>We show that fine-tuning LLaVA-1.5-7B improves accuracy by about six percentage points across tasks, GPT-4V achieves the strongest zero-shot performance with up to 69.8% accuracy, and Qwen2-VL models also perform competitively.
arXiv Detail & Related papers (2025-03-20T01:32:00Z)
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z)
DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems.<n>We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z)
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving [69.82743399946371]
DriveMLM is a framework that can perform close-loop autonomous driving in realistic simulators. We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system. This model can plug-and-play in existing AD systems such as Apollo for close-loop driving.
arXiv Detail & Related papers (2023-12-14T18:59:05Z)
Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z)
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving [6.728693243652425]
Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations.
arXiv Detail & Related papers (2023-10-03T11:05:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.