Related papers: SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

URL: http://arxiv.org/abs/2510.18034v1
Date: Mon, 20 Oct 2025 19:14:29 GMT
Title: SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection
Authors: Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz,
Abstract summary: SAVANT is a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios.<n>By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection.
Score: 6.806105013817923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution scenarios with semantic anomalies. While Vision Language Models (VLMs) offer promising reasoning capabilities, naive prompting approaches yield unreliable performance and depend on expensive proprietary models, limiting practical deployment. We introduce SAVANT (Semantic Analysis with Vision-Augmented Anomaly deTection), a structured reasoning framework that achieves high accuracy and recall in detecting anomalous driving scenarios from input images through layered scene analysis and a two-phase pipeline: structured scene description extraction followed by multi-modal evaluation. Our approach transforms VLM reasoning from ad-hoc prompting to systematic analysis across four semantic layers: Street, Infrastructure, Movable Objects, and Environment. SAVANT achieves 89.6% recall and 88.0% accuracy on real-world driving scenarios, significantly outperforming unstructured baselines. More importantly, we demonstrate that our structured framework enables a fine-tuned 7B parameter open-source model (Qwen2.5VL) to achieve 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By automatically labeling over 9,640 real-world images with high accuracy, SAVANT addresses the critical data scarcity problem in anomaly detection and provides a practical path toward reliable, accessible semantic monitoring for autonomous systems.

Related papers

Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment [1.5124107808802705]
This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies.<n>It combines our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model deployed on edge computing platform.<n>Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance.
arXiv Detail & Related papers (2026-02-03T17:03:46Z)
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension [51.76841625486355]
Referring Expression (REC) aims to localize the image region corresponding to a natural-language query.<n>Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning.<n>We introduce VIRO, a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps.
arXiv Detail & Related papers (2026-01-19T07:21:19Z)
SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection [55.54007781679915]
We propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception.<n>SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
arXiv Detail & Related papers (2026-01-14T04:42:19Z)
NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving [10.340969230365138]
Understanding risk in autonomous driving requires high-level reasoning about agent behavior and context.<n>Current Vision Language Models (Ms)-based methods primarily ground agents in static images.<n>We propose NuRisk as a critical benchmark for advancing explicit-temporal reasoning in autonomous driving.
arXiv Detail & Related papers (2025-09-30T08:37:31Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding [10.242043337117005]
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering.<n>However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored.<n>We introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos.
arXiv Detail & Related papers (2025-04-20T07:50:44Z)
Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images [2.2124795371148616]
We evaluate vision tramsformers pre-trained with masked image modelling (MIM) against synthetic out-of-distribution (OOD) benchmarks.<n>Experiments demonstrate BEIT's known robustness while maintaining 94% accuracy on PACS and 87% on Office-Home, despite significant occlusions.<n>These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.
arXiv Detail & Related papers (2025-04-05T16:25:34Z)
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z)
Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection [11.743721109110792]
This paper introduces an integrated approach combining advanced deep learning techniques and Multimodal Large Language Models (MLLMs) for comprehensive road perception.<n>For traffic sign recognition, we evaluate ResNet-50, Yv8, and RT-DETR, achieving state-of-the-art performance of 99.8% with ResNet-50, 98.0% accuracy with YOLOv8, and achieved 96.6% accuracy in RT-DETR.<n>For lane detection, we propose a CNN-based segmentation method enhanced by curve fitting, which delivers high accuracy under favorable conditions.
arXiv Detail & Related papers (2025-03-08T19:12:36Z)
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z)
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
Scene-Graph Augmented Data-Driven Risk Assessment of Autonomous Vehicle Decisions [1.4086978333609153]
We propose a novel data-driven approach that uses scene-graphs as intermediate representations. Our approach includes a Multi-Relation Graph Convolution Network, a Long-Short Term Memory Network, and attention layers for modeling the subjective risk of driving maneuvers. We show that our approach achieves a higher classification accuracy than the state-of-the-art approach on both large (96.4% vs. 91.2%) and small (91.8% vs. 71.2%) We also show that our model trained on a synthesized dataset achieves an average accuracy of 87.8% when tested on a real-world dataset.
arXiv Detail & Related papers (2020-08-31T07:41:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.