Improving Vision Anomaly Detection with the Guidance of Language
Modality
- URL: http://arxiv.org/abs/2310.02821v1
- Date: Wed, 4 Oct 2023 13:44:56 GMT
- Title: Improving Vision Anomaly Detection with the Guidance of Language
Modality
- Authors: Dong Chen, Kaihang Pan, Guoming Wang, Yueting Zhuang, Siliang Tang
- Abstract summary: This paper tackles the challenges for vision modality from a multimodal point of view.
We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue.
To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
- Score: 64.53005837237754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have seen a surge of interest in anomaly detection for tackling
industrial defect detection, event detection, etc. However, existing
unsupervised anomaly detectors, particularly those for the vision modality,
face significant challenges due to redundant information and sparse latent
space. Conversely, the language modality performs well due to its relatively
single data. This paper tackles the aforementioned challenges for vision
modality from a multimodal point of view. Specifically, we propose Cross-modal
Guidance (CMG), which consists of Cross-modal Entropy Reduction (CMER) and
Cross-modal Linear Embedding (CMLE), to tackle the redundant information issue
and sparse space issue, respectively. CMER masks parts of the raw image and
computes the matching score with the text. Then, CMER discards irrelevant
pixels to make the detector focus on critical contents. To learn a more compact
latent space for the vision anomaly detector, CMLE learns a correlation
structure matrix from the language modality, and then the latent space of
vision modality will be learned with the guidance of the matrix. Thereafter,
the vision latent space will get semantically similar images closer. Extensive
experiments demonstrate the effectiveness of the proposed methods.
Particularly, CMG outperforms the baseline that only uses images by 16.81%.
Ablation experiments further confirm the synergy among the proposed methods, as
each component depends on the other to achieve optimal performance.
Related papers
- MODEL&CO: Exoplanet detection in angular differential imaging by learning across multiple observations [37.845442465099396]
Most post-processing methods build a model of the nuisances from the target observations themselves.
We propose to build the nuisance model from an archive of multiple observations by leveraging supervised deep learning techniques.
We apply the proposed algorithm to several datasets from the VLT/SPHERE instrument, and demonstrate a superior precision-recall trade-off.
arXiv Detail & Related papers (2024-09-23T09:22:45Z) - MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.
Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.
We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - Superresolution and Segmentation of OCT scans using Multi-Stage
adversarial Guided Attention Training [18.056525121226862]
We propose the multi-stage & multi-discriminatory generative adversarial network (MultiSDGAN) to translate OCT scans in high-resolution segmentation labels.
We evaluate and compare various combinations of channel and spatial attention to the MultiSDGAN architecture to extract more powerful feature maps.
Our results demonstrate relative improvements of 21.44% and 19.45% on the Dice coefficient and SSIM, respectively.
arXiv Detail & Related papers (2022-06-10T00:26:55Z) - Multi-Perspective Anomaly Detection [3.3511723893430476]
We build upon the deep support vector data description algorithm and address multi-perspective anomaly detection.
We employ different augmentation techniques with a denoising process to deal with scarce one-class data.
We evaluate our approach on the new dices dataset using images from two different perspectives and also benchmark on the standard MNIST dataset.
arXiv Detail & Related papers (2021-05-20T17:07:36Z) - Unsupervised Anomaly Detection in MR Images using Multi-Contrast
Information [3.7273619690170796]
Anomaly detection in medical imaging is to distinguish the relevant biomarkers of diseases from those of normal tissues.
Deep supervised learning methods have shown potentials in various detection tasks, but its performances would be limited in medical imaging fields.
In this paper, we developed an unsupervised learning framework for pixel-wise anomaly detection in multi-contrast MRI.
arXiv Detail & Related papers (2021-05-02T13:05:36Z) - Self-supervised Equivariant Attention Mechanism for Weakly Supervised
Semantic Segmentation [93.83369981759996]
We propose a self-supervised equivariant attention mechanism (SEAM) to discover additional supervision and narrow the gap.
Our method is based on the observation that equivariance is an implicit constraint in fully supervised semantic segmentation.
We propose consistency regularization on predicted CAMs from various transformed images to provide self-supervision for network learning.
arXiv Detail & Related papers (2020-04-09T14:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.