Related papers: AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis

URL: http://arxiv.org/abs/2412.06510v4
Date: Thu, 07 Aug 2025 03:13:02 GMT
Title: AnomalyControl: Learning Cross-modal Semantic Features for Controllable Anomaly Synthesis
Authors: Shidan He, Lei Liu, Xiujun Shu, Bo Wang, Yuanhao Feng, Shen Zhao,
Abstract summary: We propose a novel anomaly synthesis framework called AnomalyControl to learn cross-modal semantic features as guidance signals.<n>AnomalyControl can achieve state-of-the-art results in anomaly synthesis compared with existing methods.
Score: 9.659449396370023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Anomaly synthesis is a crucial approach to augment abnormal data for advancing anomaly inspection. Based on the knowledge from the large-scale pre-training, existing text-to-image anomaly synthesis methods predominantly focus on textual information or coarse-aligned visual features to guide the entire generation process. However, these methods often lack sufficient descriptors to capture the complicated characteristics of realistic anomalies (e.g., the fine-grained visual pattern of anomalies), limiting the realism and generalization of the generation process. To this end, we propose a novel anomaly synthesis framework called AnomalyControl to learn cross-modal semantic features as guidance signals, which could encode the generalized anomaly cues from text-image reference prompts and improve the realism of synthesized abnormal samples. Specifically, AnomalyControl adopts a flexible and non-matching prompt pair (i.e., a text-image reference prompt and a targeted text prompt), where a Cross-modal Semantic Modeling (CSM) module is designed to extract cross-modal semantic features from the textual and visual descriptors. Then, an Anomaly-Semantic Enhanced Attention (ASEA) mechanism is formulated to allow CSM to focus on the specific visual patterns of the anomaly, thus enhancing the realism and contextual relevance of the generated anomaly features. Treating cross-modal semantic features as the prior, a Semantic Guided Adapter (SGA) is designed to encode effective guidance signals for the adequate and controllable synthesis process. Extensive experiments indicate that AnomalyControl can achieve state-of-the-art results in anomaly synthesis compared with existing methods while exhibiting superior performance for downstream tasks.

Related papers

TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection [70.42796551833946]
incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability.<n>We propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion.<n>Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements.
arXiv Detail & Related papers (2026-02-25T09:22:46Z)
CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion [51.060328159429154]
Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities.<n>We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts.<n> Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
arXiv Detail & Related papers (2026-01-12T13:36:48Z)
ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays [10.059919773758562]
Unsupervised anomaly detection aims to identify anomalies without pixel-level annotations.<n>This paper presents a novel Anatomy-aware Realistic Texture-based Anomaly Synthesis framework (ART-ASyn) for chest X-rays.<n>ART-ASyn generates realistic and anatomically consistent lung opacity related anomalies using texture-based augmentation.
arXiv Detail & Related papers (2025-11-29T04:06:58Z)
Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images [96.43608872116347]
AnomReason is a large-scale benchmark with structured annotations as quadruple textbfAnomAgent<n>AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images.
arXiv Detail & Related papers (2025-10-11T14:09:24Z)
Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection [20.650740481670276]
We propose textbfAdaptive textbfPrompt textbfTuning with semantic alignment for anomaly detection (APT)<n>APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios.<n>Our system achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting.
arXiv Detail & Related papers (2025-08-22T07:26:56Z)
Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection [30.77558600436759]
ARAS is a language-conditioned, auto-regressive anomaly synthesis approach.<n>It injects local, text-specified defects into normal images via token-anchored latent editing.<n>It significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies.
arXiv Detail & Related papers (2025-08-05T15:07:32Z)
Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection [53.137651284042434]
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples limits the effectiveness of existing methods.<n>We propose Generate grained Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework.<n>GAA generates realistic, diverse, and semantically aligned anomalies using only a small number of samples.
arXiv Detail & Related papers (2025-07-13T12:56:59Z)
Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing [71.29488677105127]
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters. We propose a contrastive learning-based STR framework by leveraging synthetic and real unlabeled data without any human cost. Our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark.
arXiv Detail & Related papers (2024-11-23T15:24:47Z)
Reconciling Semantic Controllability and Diversity for Remote Sensing Image Synthesis with Hybrid Semantic Embedding [12.330893658398042]
We present a Hybrid Semantic Embedding Guided Geneversarative Adversarial Network (HySEGGAN) for controllable and efficient remote sensing image synthesis. Motivated by feature description, we propose a hybrid semantic Embedding method, that coordinates fine-grained local semantic layouts. A Semantic Refinement Network (SRN) is introduced, incorporating a novel loss function to ensure fine-grained semantic feedback.
arXiv Detail & Related papers (2024-11-22T07:51:36Z)
Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection [109.72772150095646]
FAPrompt is a novel framework designed to learn Fine-grained Abnormality Prompts for accurate ZSAD.<n>Experiments on 19 real-world datasets, covering both industrial defects and medical anomalies, demonstrate that FAPrompt substantially outperforms state-of-the-art methods in both image- and pixel-level ZSAD tasks.
arXiv Detail & Related papers (2024-10-14T08:41:31Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored. We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Human-Free Automated Prompting for Vision-Language Anomaly Detection: Prompt Optimization with Meta-guiding Prompt Scheme [19.732769780675977]
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through data-driven methods.
arXiv Detail & Related papers (2024-06-26T09:29:05Z)
AnomalyXFusion: Multi-modal Anomaly Synthesis with Diffusion [31.338732251924103]
Anomaly synthesis is one of the effective methods to augment abnormal samples for training. We present the AnomalyXFusion framework, designed to harness multi-modality information to enhance the quality of synthesized abnormal samples.
arXiv Detail & Related papers (2024-04-30T10:48:43Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Updated version: A Video Anomaly Detection Framework based on Appearance-Motion Semantics Representation Consistency [2.395616571632115]
We propose a framework of Appearance-Motion Semantics Consistency Representation. The two-stream structure is designed to encode the appearance and motion information representation of normal samples. A novel consistency loss is proposed to enhance the consistency of feature semantics so that anomalies with low consistency can be identified.
arXiv Detail & Related papers (2023-03-09T08:28:34Z)
Siamese Transition Masked Autoencoders as Uniform Unsupervised Visual Anomaly Detector [4.33060257697635]
This paper proposes a novel framework termed Siamese Transition Masked Autoencoders(ST-MAE) to handle various visual anomaly detection tasks uniformly. Our deep feature transition scheme yields a nonsupervised and semantic self-supervisory task to extract normal patterns.
arXiv Detail & Related papers (2022-11-01T09:45:49Z)
GSMFlow: Generation Shifts Mitigating Flow for Generalized Zero-Shot Learning [55.79997930181418]
Generalized Zero-Shot Learning aims to recognize images from both the seen and unseen classes by transferring semantic knowledge from seen to unseen classes. It is a promising solution to take the advantage of generative models to hallucinate realistic unseen samples based on the knowledge learned from the seen classes. We propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation.
arXiv Detail & Related papers (2022-07-05T04:04:37Z)
Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks. Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
A Video Anomaly Detection Framework based on Appearance-Motion Semantics Representation Consistency [18.06814233420315]
We propose a framework that uses normal data's appearance and motion semantic representation consistency to handle anomaly detection. We design a two-stream encoder to encode the appearance and motion information representations of normal samples. Lower consistency of appearance and motion features of anomalous samples can be used to generate predicted frames with larger reconstruction error.
arXiv Detail & Related papers (2022-04-08T15:59:57Z)
Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis [68.1281982092765]
We propose a novel normalization module, termed as REtrieval-based Spatially AdaptIve normaLization (RESAIL) RESAIL provides pixel level fine-grained guidance to the normalization architecture. Experiments on several challenging datasets show that our RESAIL performs favorably against state-of-the-arts in terms of quantitative metrics, visual quality, and subjective evaluation.
arXiv Detail & Related papers (2022-04-06T14:21:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.