Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly
Detection
- URL: http://arxiv.org/abs/2312.07495v1
- Date: Tue, 12 Dec 2023 18:28:59 GMT
- Title: Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly
Detection
- Authors: Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu,
Xiangtai Li, Ming-Hsuan Yang, Dacheng Tao
- Abstract summary: This work studies the recently proposed challenging and practical Multi-class Unsupervised Anomaly Detection (MUAD) task.
It only requires normal images for training while simultaneously testing both normal/anomaly images for multiple classes.
A plain Vision Transformer (ViT) with simple architecture has been shown effective in multiple domains.
- Score: 133.93365706990178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies the recently proposed challenging and practical Multi-class
Unsupervised Anomaly Detection (MUAD) task, which only requires normal images
for training while simultaneously testing both normal/anomaly images for
multiple classes. Existing reconstruction-based methods typically adopt pyramid
networks as encoders/decoders to obtain multi-resolution features, accompanied
by elaborate sub-modules with heavier handcraft engineering designs for more
precise localization. In contrast, a plain Vision Transformer (ViT) with simple
architecture has been shown effective in multiple domains, which is simpler,
more effective, and elegant. Following this spirit, this paper explores plain
ViT architecture for MUAD. Specifically, we abstract a Meta-AD concept by
inducing current reconstruction-based methods. Then, we instantiate a novel and
elegant plain ViT-based symmetric ViTAD structure, effectively designed step by
step from three macro and four micro perspectives. In addition, this paper
reveals several interesting findings for further exploration. Finally, we
propose a comprehensive and fair evaluation benchmark on eight metrics for the
MUAD task. Based on a naive training recipe, ViTAD achieves state-of-the-art
(SoTA) results and efficiency on the MVTec AD and VisA datasets without bells
and whistles, obtaining 85.4 mAD that surpasses SoTA UniAD by +3.0, and only
requiring 1.1 hours and 2.3G GPU memory to complete model training by a single
V100 GPU. Source code, models, and more results are available at
https://zhangzjn.github.io/projects/ViTAD.
Related papers
- Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection [29.370142078092375]
In this paper, we introduce a minimalistic reconstruction-based anomaly detection framework, namely Dinomaly.
Our proposed Dinomaly achieves impressive image AUROC of 99.6%, 98.7%, and 89.3% on three datasets respectively.
arXiv Detail & Related papers (2024-05-23T08:55:20Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One [47.58919672657824]
We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One)
We develop a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models.
Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework.
arXiv Detail & Related papers (2023-12-10T17:07:29Z) - Contrastive Learning for Multi-Object Tracking with Transformers [79.61791059432558]
We show how DETR can be turned into a MOT model by employing an instance-level contrastive loss.
Our training scheme learns object appearances while preserving detection capabilities and with little overhead.
Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset.
arXiv Detail & Related papers (2023-11-14T10:07:52Z) - General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation.
Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance.
We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.