AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization
- URL: http://arxiv.org/abs/2508.04175v1
- Date: Wed, 06 Aug 2025 08:00:27 GMT
- Title: AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization
- Authors: Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wenhao Sun, Yiting Li, Dacheng Tao, Xun Xu, Xulei Yang,
- Abstract summary: We propose a comprehensive framework addressing limitations through two synergistic innovations.<n>First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination.<n>Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision.
- Score: 43.86757207244911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection. Our method achieves superior accuracy with efficient adaptation of existing annotations, effectively bridging the gap between general-purpose MLLM capabilities and the fine-grained visual discrimination required for detecting subtle manufacturing defects and structural irregularities.
Related papers
- SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment [12.388954043805235]
Vision-Language Models (VLMs) often struggle in industrial anomaly detection and reasoning.<n>SAGE is a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO)<n>SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings.
arXiv Detail & Related papers (2025-07-10T17:23:42Z) - CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection [54.85000884785013]
Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types, and the scarcity of training data.<n>We propose CLIPfusion, a method that leverages both discriminative and generative foundation models.<n>We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection.
arXiv Detail & Related papers (2025-06-13T13:30:15Z) - Reasoning Meets Personalization: Unleashing the Potential of Large Reasoning Model for Personalized Generation [21.89080753903469]
We present the first systematic evaluation of large reasoning models (LRMs) for personalization tasks.<n>Our analysis identifies three key limitations: divergent thinking, misalignment of response formats, and ineffective use of retrieved information.<n>We propose Reinforced Reasoning for Personalization (model), a novel framework that incorporates a hierarchical reasoning thought template to guide LRMs in generating structured outputs.
arXiv Detail & Related papers (2025-05-23T07:30:13Z) - AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection [40.34270276536052]
Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples.<n>Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation.<n>We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability.
arXiv Detail & Related papers (2025-04-16T09:48:41Z) - Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift [51.24522135151649]
Anomaly detection plays a crucial role in quality control for industrial applications.<n>Existing methods attempt to address domain shifts by training generalizable models.<n>Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.
arXiv Detail & Related papers (2025-03-19T05:25:52Z) - EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models [23.898938659720503]
Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing.<n>We propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction.<n>We also contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA)
arXiv Detail & Related papers (2025-03-18T11:33:29Z) - RAAD-LLM: Adaptive Anomaly Detection Using LLMs and RAG Integration [2.879328762187361]
We present RAAD-LLM, a novel framework for adaptive anomaly detection.<n>By effectively utilizing domain-specific knowledge, RAAD-LLM enhances the detection of anomalies in time series data.<n>Results show significant improvements over our previous model with an accuracy increase from 70.7% to 88.6% on the real-world dataset.
arXiv Detail & Related papers (2025-03-04T17:20:43Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks [49.84182981950623]
Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task.<n>It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies.<n>We introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models.
arXiv Detail & Related papers (2024-11-27T12:18:39Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.