Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly
Detection
- URL: http://arxiv.org/abs/2312.07495v1
- Date: Tue, 12 Dec 2023 18:28:59 GMT
- Title: Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly
Detection
- Authors: Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu,
Xiangtai Li, Ming-Hsuan Yang, Dacheng Tao
- Abstract summary: This work studies the recently proposed challenging and practical Multi-class Unsupervised Anomaly Detection (MUAD) task.
It only requires normal images for training while simultaneously testing both normal/anomaly images for multiple classes.
A plain Vision Transformer (ViT) with simple architecture has been shown effective in multiple domains.
- Score: 133.93365706990178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies the recently proposed challenging and practical Multi-class
Unsupervised Anomaly Detection (MUAD) task, which only requires normal images
for training while simultaneously testing both normal/anomaly images for
multiple classes. Existing reconstruction-based methods typically adopt pyramid
networks as encoders/decoders to obtain multi-resolution features, accompanied
by elaborate sub-modules with heavier handcraft engineering designs for more
precise localization. In contrast, a plain Vision Transformer (ViT) with simple
architecture has been shown effective in multiple domains, which is simpler,
more effective, and elegant. Following this spirit, this paper explores plain
ViT architecture for MUAD. Specifically, we abstract a Meta-AD concept by
inducing current reconstruction-based methods. Then, we instantiate a novel and
elegant plain ViT-based symmetric ViTAD structure, effectively designed step by
step from three macro and four micro perspectives. In addition, this paper
reveals several interesting findings for further exploration. Finally, we
propose a comprehensive and fair evaluation benchmark on eight metrics for the
MUAD task. Based on a naive training recipe, ViTAD achieves state-of-the-art
(SoTA) results and efficiency on the MVTec AD and VisA datasets without bells
and whistles, obtaining 85.4 mAD that surpasses SoTA UniAD by +3.0, and only
requiring 1.1 hours and 2.3G GPU memory to complete model training by a single
V100 GPU. Source code, models, and more results are available at
https://zhangzjn.github.io/projects/ViTAD.
Related papers
- Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning [18.268054258939213]
Multi-Modal Object Detection (MMOD) has been widely applied in various applications.<n>This paper introduces linear probing evaluation to the multi-modal detectors.<n>We construct a novel framework called M$2$D-LIF, which consists of the Mono-Modality Distillation (M$2$D) method and the Local Illumination-aware Fusion (LIF) module.
arXiv Detail & Related papers (2025-03-14T18:15:53Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection [31.028622674616134]
We introduce a reconstruction-based anomaly detection framework, namely Dinomaly.
Our proposed Dinomaly achieves impressive image-level AU achieves 99.6%, 98.7%, and 89.3% on the three datasets respectively.
arXiv Detail & Related papers (2024-05-23T08:55:20Z) - Learning Feature Inversion for Multi-class Anomaly Detection under General-purpose COCO-AD Benchmark [101.23684938489413]
Anomaly detection (AD) is often focused on detecting anomalies for industrial quality inspection and medical lesion examination.
This work first constructs a large-scale and general-purpose COCO-AD dataset by extending COCO to the AD field.
Inspired by the metrics in the segmentation field, we propose several more practical threshold-dependent AD-specific metrics.
arXiv Detail & Related papers (2024-04-16T17:38:26Z) - General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation.
Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance.
We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection [46.37418710853632]
We study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD.
We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline.
This simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs.
arXiv Detail & Related papers (2022-05-05T15:42:56Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.