Related papers: EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO

EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO

URL: http://arxiv.org/abs/2507.21619v1
Date: Tue, 29 Jul 2025 09:18:22 GMT
Title: EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO
Authors: Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, Weiqiang Wang,
Abstract summary: We propose EMIT, a unified framework that enhances large language models (MLLMs) for industrial anomaly detection (IAD)<n>EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images.<n>For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons.<n>Experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77% over the base model.
Score: 39.94790536636158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by incorporating a response resampling strategy to ensure the inclusion of correct answers in the sampled responses, as well as an advantage reweighting mechanism to strengthen learning from such difficult data samples. Extensive experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77\% over the base model (InternVL3-8B) across seven tasks.

Related papers

Group Relative Augmentation for Data Efficient Action Detection [11.169883977958454]
Adapting large Video-Language Models (VLMs) for action detection using only a few examples poses challenges.<n>We propose an efficient adaptation strategy combining parameter-efficient tuning (LoRA) with a novel learnable internal feature augmentation.<n>We demonstrate our method's effectiveness on complex multi-label, multi-person action detection datasets.
arXiv Detail & Related papers (2025-07-28T21:46:05Z)
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems [25.882461853973897]
We propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which guides policy updates by estimating relative reward advantages.<n>MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead.<n>We also introduce three group rollout sampling strategies that trade off between efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-03T10:17:19Z)
MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR [59.83547898874152]
We introduce a sample-efficient, two-stage adaptation approach that integrates self-supervised learning with semi-supervised techniques.<n>MSDA is designed to enhance the robustness and generalization of ASR models.<n>We demonstrate that Meta PL can be applied effectively to ASR tasks, achieving state-of-the-art results.
arXiv Detail & Related papers (2025-05-30T14:46:05Z)
AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection [40.34270276536052]
Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples.<n>Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation.<n>We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability.
arXiv Detail & Related papers (2025-04-16T09:48:41Z)
An Empirical Study of Conformal Prediction in LLM with ASP Scaffolds for Robust Reasoning [52.29223403698673]
This paper examines the use of Conformal Language Modelling (CLM) alongside Answer Set Programming (ASP)<n>We apply CLM to generate sets of ASP programs from an LLM, providing statistical guarantees on the correctness of the outputs.<n> Experimental results show that CLM significantly outperforms baseline models that use standard sampling methods.
arXiv Detail & Related papers (2025-03-07T14:10:10Z)
RAAD-LLM: Adaptive Anomaly Detection Using LLMs and RAG Integration [2.879328762187361]
We present RAAD-LLM, a novel framework for adaptive anomaly detection.<n>By effectively utilizing domain-specific knowledge, RAAD-LLM enhances the detection of anomalies in time series data.<n>Results show significant improvements over our previous model with an accuracy increase from 70.7% to 88.6% on the real-world dataset.
arXiv Detail & Related papers (2025-03-04T17:20:43Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs [9.35901507816989]
We introduce AgentPS, a framework that integrates Agentic Process Supervision into large language models.<n>We show that AgentPS achieves substantial improvements over baseline MLLMs on public benchmarks and proprietary datasets.<n>These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.
arXiv Detail & Related papers (2024-12-15T04:58:00Z)
R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge [78.26352952957909]
Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently.<n>The concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM.<n>In this paper, the problem of enabling edge users to collaboratively craft such MTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks.
arXiv Detail & Related papers (2024-11-27T10:57:06Z)
mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA [78.45521005703958]
multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge. We propose a novel framework called textbfRetrieval-textbfReftextbfAugmented textbfGeneration (mR$2$AG) which achieves adaptive retrieval and useful information localization. mR$2$AG significantly outperforms state-of-the-art MLLMs on INFOSEEK and Encyclopedic-VQA
arXiv Detail & Related papers (2024-11-22T16:15:50Z)
Adapting Large Multimodal Models to Distribution Shifts: The Role of In-Context Learning [41.59855801010565]
Large multimodal models (LMMs) potentially act as general-purpose assistants and are highly robust against different distributions. Despite this, domain-specific adaptation is still necessary particularly in specialized areas like healthcare. This work investigates in-context learning (ICL) as an effective alternative for enhancing LMMs' adaptability.
arXiv Detail & Related papers (2024-05-20T17:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.