Related papers: Learning Multi-view Anomaly Detection

Learning Multi-view Anomaly Detection

URL: http://arxiv.org/abs/2407.11935v1
Date: Tue, 16 Jul 2024 17:26:34 GMT
Title: Learning Multi-view Anomaly Detection
Authors: Haoyang He, Jiangning Zhang, Guanzhong Tian, Chengjie Wang, Lei Xie,
Abstract summary: This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task. We introduce the textbfMulti-textbfView textbfAnomaly textbfDetection (textbfMVAD) framework, which learns and integrates features from multi-views.
Score: 42.94263165352097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task. Single-view tasks would encounter blind spots from other perspectives, resulting in inaccuracies in sample-level prediction. Therefore, we introduce the \textbf{M}ulti-\textbf{V}iew \textbf{A}nomaly \textbf{D}etection (\textbf{MVAD}) framework, which learns and integrates features from multi-views. Specifically, we proposed a \textbf{M}ulti-\textbf{V}iew \textbf{A}daptive \textbf{S}election (\textbf{MVAS}) algorithm for feature learning and fusion across multiple views. The feature maps are divided into neighbourhood attention windows to calculate a semantic correlation matrix between single-view windows and all other views, which is a conducted attention mechanism for each single-view window and the top-K most correlated multi-view windows. Adjusting the window sizes and top-K can minimise the computational complexity to linear. Extensive experiments on the Real-IAD dataset for cross-setting (multi/single-class) validate the effectiveness of our approach, achieving state-of-the-art performance among sample \textbf{4.1\%}$\uparrow$/ image \textbf{5.6\%}$\uparrow$/pixel \textbf{6.7\%}$\uparrow$ levels with a total of ten metrics with only \textbf{18M} parameters and fewer GPU memory and training time.

Related papers

Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference. Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Balanced Multi-view Clustering [56.17836963920012]
Multi-view clustering (MvC) aims to integrate information from different views to enhance the capability of the model in capturing the underlying data structures.<n>The widely used joint training paradigm in MvC is potentially not fully leverage the multi-view information.<n>We propose a novel balanced multi-view clustering (BMvC) method, which introduces a view-specific contrastive regularization (VCR) to modulate the optimization of each view.
arXiv Detail & Related papers (2025-01-05T14:42:47Z)
I Spy With My Little Eye: A Minimum Cost Multicut Investigation of Dataset Frames [12.177674038614658]
Visual framing analysis is a key method in social sciences for determining common themes and concepts in a discourse. In this work, we phrase the clustering task as a Minimum Cost Multicut Problem [MP] Solutions to the MP have been shown to provide clusterings that maximize the posterior probability, solely from provided local, pairwise probabilities of two images belonging to the same cluster. Our insights into embedding space differences in combination with the optimal clustering - by definition - advances automated visual frame detection.
arXiv Detail & Related papers (2024-12-02T09:09:47Z)
VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin) VisMin requires models to predict the correct image-caption match given two images and two captions. We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z)
Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism. This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors. Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z)
S^2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering [38.35594663863098]
Experimental results on six large-scale multi-view datasets demonstrate that S2MVTC significantly outperforms state-of-the-art algorithms in terms of clustering performance and CPU execution time.
arXiv Detail & Related papers (2024-03-14T05:00:29Z)
One for all: A novel Dual-space Co-training baseline for Large-scale Multi-View Clustering [42.92751228313385]
We propose a novel multi-view clustering model, named Dual-space Co-training Large-scale Multi-view Clustering (DSCMC) The main objective of our approach is to enhance the clustering performance by leveraging co-training in two distinct spaces. Our algorithm has an approximate linear computational complexity, which guarantees its successful application on large-scale datasets.
arXiv Detail & Related papers (2024-01-28T16:30:13Z)
Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals. We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z)
DealMVC: Dual Contrastive Calibration for Multi-view Clustering [78.54355167448614]
We propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC) We first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels.
arXiv Detail & Related papers (2023-08-17T14:14:28Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Deep Incomplete Multi-view Clustering with Cross-view Partial Sample and Prototype Alignment [50.82982601256481]
We propose a Cross-view Partial Sample and Prototype Alignment Network (CPSPAN) for Deep Incomplete Multi-view Clustering. Unlike existing contrastive-based methods, we adopt pair-observed data alignment as 'proxy supervised signals' to guide instance-to-instance correspondence construction.
arXiv Detail & Related papers (2023-03-28T02:31:57Z)
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation. We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z)
Adaptively-weighted Integral Space for Fast Multiview Clustering [54.177846260063966]
We propose an Adaptively-weighted Integral Space for Fast Multiview Clustering (AIMC) with nearly linear complexity. Specifically, view generation models are designed to reconstruct the view observations from the latent integral space. Experiments conducted on several realworld datasets confirm the superiority of the proposed AIMC method.
arXiv Detail & Related papers (2022-08-25T05:47:39Z)
Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z)
Tensor-based Intrinsic Subspace Representation Learning for Multi-view Clustering [18.0093330816895]
We propose a novel-based Intrinsic Subspace Representation (TISRL) for multi-view clustering in this paper. It can be seen that specific information contained in different views is fully investigated by the rank preserving decomposition. Experimental results on nine common used real-world multi-view datasets illustrate the superiority of TISRL.
arXiv Detail & Related papers (2020-10-19T03:36:18Z)
Multi-view Low-rank Preserving Embedding: A Novel Method for Multi-view Representation [11.91574721055601]
This paper proposes a novel multi-view learning method, named Multi-view Low-rank Preserving Embedding (MvLPE) It integrates different views into one centroid view by minimizing the disagreement term, based on distance or similarity matrix among instances. Experiments on six benchmark datasets demonstrate that the proposed method outperforms its counterparts.
arXiv Detail & Related papers (2020-06-14T12:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.