Related papers: Evaluation of Large Language Models for Anomaly Detection in Autonomous Vehicles

Evaluation of Large Language Models for Anomaly Detection in Autonomous Vehicles

URL: http://arxiv.org/abs/2509.05315v1
Date: Fri, 29 Aug 2025 13:05:13 GMT
Title: Evaluation of Large Language Models for Anomaly Detection in Autonomous Vehicles
Authors: Petros Loukas, David Bassir, Savvas Chatzichristofis, Angelos Amanatiadis,
Abstract summary: This work evaluates large language models (LLMs) on real-world edge cases where current autonomous vehicles have been proven to fail.<n>The proposed architecture consists of an open vocabulary object detector coupled with prompt engineering and large language model contextual reasoning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of large language models (LLMs) has pushed their boundaries to many applications in various domains. Recently, the research community has started to evaluate their potential adoption in autonomous vehicles and especially as complementary modules in the perception and planning software stacks. However, their evaluation is limited in synthetic datasets or manually driving datasets without the ground truth knowledge and more precisely, how the current perception and planning algorithms would perform in the cases under evaluation. For this reason, this work evaluates LLMs on real-world edge cases where current autonomous vehicles have been proven to fail. The proposed architecture consists of an open vocabulary object detector coupled with prompt engineering and large language model contextual reasoning. We evaluate several state-of-the-art models against real edge cases and provide qualitative comparison results along with a discussion on the findings for the potential application of LLMs as anomaly detectors in autonomous vehicles.

Related papers

Claim Automation using Large Language Model [0.0]
Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, but their deployment in regulated and data-sensitive domains, including insurance, remains limited.<n>We propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives.<n>We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions.
arXiv Detail & Related papers (2026-02-18T20:01:12Z)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis [19.212494396144404]
Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems.<n>Foundation models represent a new generation of pre-trained, general-purpose AI models.<n>Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios.
arXiv Detail & Related papers (2025-06-13T07:25:59Z)
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving [16.602141801221364]
STSBench is a framework to benchmark holistic understanding of vision-language models (VLMs) for autonomous driving.<n>The benchmark features 43 diverse scenarios spanning multiple views, resulting in 971 human-verified multiple-choice questions.<n>A thorough evaluation uncovers shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments.
arXiv Detail & Related papers (2025-06-06T16:25:22Z)
MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z)
A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving [15.24721920935653]
Multimodal large language models (MLLMs) hold the potential to enhance autonomous driving.<n>Their integration into autonomous driving systems shows promising results in isolated proof-of-concept applications.<n>This paper proposes a holistic framework for a capability-driven evaluation of MLLMs in autonomous driving.
arXiv Detail & Related papers (2025-03-14T13:43:26Z)
Generating Out-Of-Distribution Scenarios Using Language Models [58.47597351184034]
Large Language Models (LLMs) have shown promise in autonomous driving. This paper introduces a framework for generating diverse Out-Of-Distribution (OOD) driving scenarios. We evaluate our framework through extensive simulations and introduce a new "OOD-ness" metric.
arXiv Detail & Related papers (2024-11-25T16:38:17Z)
Automatic benchmarking of large multimodal models via iterative experiment programming [71.78089106671581]
We present APEx, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions.
arXiv Detail & Related papers (2024-06-18T06:43:46Z)
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving [69.82743399946371]
DriveMLM is a framework that can perform close-loop autonomous driving in realistic simulators. We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system. This model can plug-and-play in existing AD systems such as Apollo for close-loop driving.
arXiv Detail & Related papers (2023-12-14T18:59:05Z)
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [38.28159034562901]
Reason2Drive is a benchmark dataset with over 600K video-text pairs. We characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps. We introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems.
arXiv Detail & Related papers (2023-12-06T18:32:33Z)
LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers. In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.