Related papers: VisionGPT: LLM-Assisted Real-Time Anomaly Detection for Safe Visual Navigation

Related papers

IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z)
Object Detection with Multimodal Large Vision-Language Models: An In-depth Review [3.2882817259131403]
The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection.<n>This in-depth review presents a structured exploration of the state-of-the-art in LVLMs.
arXiv Detail & Related papers (2025-08-25T17:21:00Z)
From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection [60.11169426478452]
This paper aims to introduce fixation information to assist the detection of salient objects under weak supervision.<n>We propose a Position and Semantic Embedding (PSE) module to provide location and semantic guidance during the feature learning process.<n>An Intra-Inter Mixed Contrastive (MCII) model improves thetemporal modeling capabilities under weak supervision.
arXiv Detail & Related papers (2025-06-30T05:01:40Z)
Transferable Adversarial Attacks on Black-Box Vision-Language Models [63.22532779621001]
adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts.<n>We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information.<n>We discover that universal perturbations -- modifications applicable to a wide set of images -- can consistently induce these misinterpretations.
arXiv Detail & Related papers (2025-05-02T06:51:11Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection [18.125287697902813]
Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection. We present a novel paradigm that unlocksVLMs' potential capabilities through three components.
arXiv Detail & Related papers (2025-03-19T03:20:03Z)
Large Models in Dialogue for Active Perception and Anomaly Detection [35.16837804526144]
We propose a framework to actively collect information and perform anomaly detection in novel scenes.<n>Two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy.<n>In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness.
arXiv Detail & Related papers (2025-01-27T18:38:36Z)
When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis [6.213279061986497]
SeeUnsafe is a framework that transforms video-based traffic accident analysis into a more interactive, conversational approach. Our framework employs a multimodal-based aggregation strategy to handle videos of various lengths and generate structured responses for review and evaluation. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding.
arXiv Detail & Related papers (2025-01-17T23:35:34Z)
Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight [2.290956583394892]
Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs) This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024.
arXiv Detail & Related papers (2024-12-24T09:05:37Z)
Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent [8.212818176634116]
We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness.
arXiv Detail & Related papers (2024-11-08T15:50:30Z)
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines [18.602869210526848]
Vision Search Assistant is a novel framework that facilitates collaboration between vision-language models and web agents. By integrating visual and textual representations through this collaboration, the model can provide informed responses even when the image is novel to the system.
arXiv Detail & Related papers (2024-10-28T17:04:18Z)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs) VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z)
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models [72.75669790569629]
Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. We find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision. We propose a novel Text-Guided vision-language alignment method (TGA) for LVLMs.
arXiv Detail & Related papers (2024-10-16T15:20:08Z)
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection [19.79027968793026]
Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects. Existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. We propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception.
arXiv Detail & Related papers (2024-09-30T09:51:29Z)
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting [68.37943632270505]
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond categories. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories. We propose an open-vocabulary relationship that leverages the rich semantic knowledge of CLIP to discover novel relationships.
arXiv Detail & Related papers (2024-09-19T06:25:01Z)
Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs) Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z)
InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding [12.082379948480257]
This paper proposes InsightSee, a multi-agent framework to enhance vision-language models' capabilities in handling complex visual understanding scenarios. The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation. The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.
arXiv Detail & Related papers (2024-05-31T13:56:55Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs [55.8550939439138]
Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems. These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions. We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
arXiv Detail & Related papers (2024-02-13T18:39:18Z)
HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving [44.06475712570428]
HiLM-D is a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories. Experiments show HiLM-D's significant improvements over current MLLMs, with a 3.7% in BLEU-4 for captioning and 8.7% in mIoU for detection.
arXiv Detail & Related papers (2023-09-11T01:24:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.