Related papers: VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

URL: http://arxiv.org/abs/2511.08173v1
Date: Wed, 12 Nov 2025 01:44:46 GMT
Title: VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion
Authors: Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada,
Abstract summary: ours is a novel unsupervised multi-class visual anomaly detection framework.<n>It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection.<n>Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset.
Score: 15.486565360380203
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.

Related papers

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z)
HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs [8.18063726177317]
Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences.<n>We propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning.
arXiv Detail & Related papers (2025-07-23T10:41:46Z)
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling [38.700993166492495]
We propose a dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model.<n>Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition.
arXiv Detail & Related papers (2025-07-01T14:25:09Z)
Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task. MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.<n>Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way.<n>We also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals.
arXiv Detail & Related papers (2024-10-14T17:57:18Z)
VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection [5.66050466694651]
We propose Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness. We also propose a new scoring function that enables data- and training-free outlier supervision via textual prompts. The resulting VL4AD model achieves competitive performance on widely used benchmark datasets.
arXiv Detail & Related papers (2024-09-25T20:12:10Z)
Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Training-based methods are prone to be domain-specific, thus being costly for practical deployment. We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z)
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [49.80911683739506]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.<n>We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.<n>Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z)
Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.