Learning Multi-view Anomaly Detection
- URL: http://arxiv.org/abs/2407.11935v1
- Date: Tue, 16 Jul 2024 17:26:34 GMT
- Title: Learning Multi-view Anomaly Detection
- Authors: Haoyang He, Jiangning Zhang, Guanzhong Tian, Chengjie Wang, Lei Xie,
- Abstract summary: This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task.
We introduce the textbfMulti-textbfView textbfAnomaly textbfDetection (textbfMVAD) framework, which learns and integrates features from multi-views.
- Score: 42.94263165352097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task. Single-view tasks would encounter blind spots from other perspectives, resulting in inaccuracies in sample-level prediction. Therefore, we introduce the \textbf{M}ulti-\textbf{V}iew \textbf{A}nomaly \textbf{D}etection (\textbf{MVAD}) framework, which learns and integrates features from multi-views. Specifically, we proposed a \textbf{M}ulti-\textbf{V}iew \textbf{A}daptive \textbf{S}election (\textbf{MVAS}) algorithm for feature learning and fusion across multiple views. The feature maps are divided into neighbourhood attention windows to calculate a semantic correlation matrix between single-view windows and all other views, which is a conducted attention mechanism for each single-view window and the top-K most correlated multi-view windows. Adjusting the window sizes and top-K can minimise the computational complexity to linear. Extensive experiments on the Real-IAD dataset for cross-setting (multi/single-class) validate the effectiveness of our approach, achieving state-of-the-art performance among sample \textbf{4.1\%}$\uparrow$/ image \textbf{5.6\%}$\uparrow$/pixel \textbf{6.7\%}$\uparrow$ levels with a total of ten metrics with only \textbf{18M} parameters and fewer GPU memory and training time.
Related papers
- VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism.
This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors.
Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z) - Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals.
We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - Adaptively-weighted Integral Space for Fast Multiview Clustering [54.177846260063966]
We propose an Adaptively-weighted Integral Space for Fast Multiview Clustering (AIMC) with nearly linear complexity.
Specifically, view generation models are designed to reconstruct the view observations from the latent integral space.
Experiments conducted on several realworld datasets confirm the superiority of the proposed AIMC method.
arXiv Detail & Related papers (2022-08-25T05:47:39Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - Multi-view Low-rank Preserving Embedding: A Novel Method for Multi-view
Representation [11.91574721055601]
This paper proposes a novel multi-view learning method, named Multi-view Low-rank Preserving Embedding (MvLPE)
It integrates different views into one centroid view by minimizing the disagreement term, based on distance or similarity matrix among instances.
Experiments on six benchmark datasets demonstrate that the proposed method outperforms its counterparts.
arXiv Detail & Related papers (2020-06-14T12:47:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.