Related papers: Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

URL: http://arxiv.org/abs/2504.13399v1
Date: Fri, 18 Apr 2025 01:25:02 GMT
Title: Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety
Authors: Shashank Shriram, Srinivasa Perisetla, Aryan Keskar, Harsha Krishnaswamy, Tonko Emil Westerhof Bossen, Andreas Møgelmose, Ross Greer,
Abstract summary: We propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection.<n>We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations.<n>Our findings highlight the strengths and limitations of current vision-language-based approaches.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git

Related papers

Addressing Out-of-Label Hazard Detection in Dashcam Videos: Insights from the COOOL Challenge [0.0]
This paper presents a novel approach for hazard analysis in dashcam footage.<n>It addresses the detection of driver reactions to hazards, the identification of hazardous objects, and the generation of descriptive captions.<n>Our method achieved the highest scores in the Challenge on Out-of-Label in Autonomous Driving.
arXiv Detail & Related papers (2025-01-27T13:32:01Z)
FuzzRisk: Online Collision Risk Estimation for Autonomous Vehicles based on Depth-Aware Object Detection via Fuzzy Inference [6.856508678236828]
The framework takes two sets of predictions from different algorithms and associates their inconsistencies with the collision risk via fuzzy inference.<n>We experimentally validate that, based on Intersection-over-Union (IoU) and a depth discrepancy measure, the inconsistencies between the two sets of predictions strongly correlate to the error of the 3D object detector.<n>We optimize the fuzzy inference system towards an existing offline metric that matches AV collision rates well.
arXiv Detail & Related papers (2024-11-09T20:20:36Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
CableInspect-AD: An Expert-Annotated Anomaly Detection Dataset [14.246172794156987]
$textitCableInspect-AD$ is a high-quality dataset created and annotated by domain experts from Hydro-Qu'ebec, a Canadian public utility. This dataset includes high-resolution images with challenging real-world anomalies, covering defects with varying severity levels. We present a comprehensive evaluation protocol based on cross-validation to assess models' performances.
arXiv Detail & Related papers (2024-09-30T14:50:13Z)
CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions [13.981748780317329]
Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs) This study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA)
arXiv Detail & Related papers (2024-07-25T04:12:49Z)
Bayesian Detector Combination for Object Detection with Crowdsourced Annotations [49.43709660948812]
Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise. We propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations. BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models.
arXiv Detail & Related papers (2024-07-10T18:00:54Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation [70.75100533512021]
In this paper, we formulate the label uncertainty problem as the diversity of potentially plausible bounding boxes of objects. We propose GLENet, a generative framework adapted from conditional variational autoencoders, to model the one-to-many relationship between a typical 3D object and its potential ground-truth bounding boxes with latent variables. The label uncertainty generated by GLENet is a plug-and-play module and can be conveniently integrated into existing deep 3D detectors.
arXiv Detail & Related papers (2022-07-06T06:26:17Z)
Exploiting Multi-Object Relationships for Detecting Adversarial Attacks in Complex Scenes [51.65308857232767]
Vision systems that deploy Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Recent research has shown that checking the intrinsic consistencies in the input data is a promising way to detect adversarial attacks. We develop a novel approach to perform context consistency checks using language models.
arXiv Detail & Related papers (2021-08-19T00:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.