2nd Place Winning Solution for the CVPR2023 Visual Anomaly and Novelty
Detection Challenge: Multimodal Prompting for Data-centric Anomaly Detection
- URL: http://arxiv.org/abs/2306.09067v2
- Date: Tue, 5 Sep 2023 14:44:04 GMT
- Title: 2nd Place Winning Solution for the CVPR2023 Visual Anomaly and Novelty
Detection Challenge: Multimodal Prompting for Data-centric Anomaly Detection
- Authors: Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Liang Gao, Weiming Shen
- Abstract summary: This report introduces the winning solution of the team Segment Any Anomaly for the CVPR2023 Visual Anomaly and Novelty Detection (VAND) challenge.
We present a novel framework, i.e., Segment Any Anomaly + (SAA$+$), for zero-shot anomaly segmentation with multi-modal prompts.
We will release the code of our winning solution for the CVPR2023 VAN.
- Score: 10.682758791557436
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This technical report introduces the winning solution of the team Segment Any
Anomaly for the CVPR2023 Visual Anomaly and Novelty Detection (VAND) challenge.
Going beyond uni-modal prompt, e.g., language prompt, we present a novel
framework, i.e., Segment Any Anomaly + (SAA$+$), for zero-shot anomaly
segmentation with multi-modal prompts for the regularization of cascaded modern
foundation models. Inspired by the great zero-shot generalization ability of
foundation models like Segment Anything, we first explore their assembly (SAA)
to leverage diverse multi-modal prior knowledge for anomaly localization.
Subsequently, we further introduce multimodal prompts (SAA$+$) derived from
domain expert knowledge and target image context to enable the non-parameter
adaptation of foundation models to anomaly segmentation. The proposed SAA$+$
model achieves state-of-the-art performance on several anomaly segmentation
benchmarks, including VisA and MVTec-AD, in the zero-shot setting. We will
release the code of our winning solution for the CVPR2023 VAN.
Related papers
- First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation [1.8570591025615457]
We present the first place solution to the ECCV 2024 BRAVO Challenge.
A model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets.
This approach outperforms more complex existing approaches, and achieves first place in the challenge.
arXiv Detail & Related papers (2024-09-25T16:15:06Z) - Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models [7.350203999073509]
Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training models to subtle yet intentionally designed perturbations in images and texts.
To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image.
arXiv Detail & Related papers (2024-08-06T06:25:39Z) - Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning [7.84845040922464]
We present our solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024.
Unlike traditional visual questions and answer tasks, this challenge evaluates abstraction, deduction and generalization ability of neural network.
Our model is based on two pre-trained models, dedicated to extract features from text and image respectively.
arXiv Detail & Related papers (2024-06-08T01:45:06Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering [48.7363941445826]
We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting.
We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research.
arXiv Detail & Related papers (2024-03-21T18:57:25Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Segment Any Anomaly without Training via Hybrid Prompt Regularization [15.38935129648466]
We present a novel framework, i.e., Segment Any Anomaly + (SAA+), for zero-shot anomaly segmentation with hybrid prompt regularization.
Our proposed SAA+ model achieves state-of-the-art performance on several anomaly segmentation benchmarks, including VisA, MVTec-AD, MTD, and KSDD2.
arXiv Detail & Related papers (2023-05-18T05:52:06Z) - Exploring Multi-Modal Representations for Ambiguity Detection &
Coreference Resolution in the SIMMC 2.0 Challenge [60.616313552585645]
We present models for effective Ambiguity Detection and Coreference Resolution in Conversational AI.
Specifically, we use TOD-BERT and LXMERT based models, compare them to a number of baselines and provide ablation experiments.
Our results show that (1) language models are able to exploit correlations in the data to detect ambiguity; and (2) unimodal coreference resolution models can avoid the need for a vision component.
arXiv Detail & Related papers (2022-02-25T12:10:02Z) - Modality Completion via Gaussian Process Prior Variational Autoencoders
for Multi-Modal Glioma Segmentation [75.58395328700821]
We propose a novel model, Multi-modal Gaussian Process Prior Variational Autoencoder (MGP-VAE), to impute one or more missing sub-modalities for a patient scan.
MGP-VAE can leverage the Gaussian Process (GP) prior on the Variational Autoencoder (VAE) to utilize the subjects/patients and sub-modalities correlations.
We show the applicability of MGP-VAE on brain tumor segmentation where either, two, or three of four sub-modalities may be missing.
arXiv Detail & Related papers (2021-07-07T19:06:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.