Related papers: Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

URL: http://arxiv.org/abs/2505.13123v1
Date: Mon, 19 May 2025 13:51:57 GMT
Title: Just Dance with $π$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection
Authors: Snehashis Majhi, Giacomo D'Amicantonio, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, Egor Bondarev, Francois Bremond,
Abstract summary: "PI-VAD" is a novel approach that augments RGB representations by five additional modalities.<n>PI-VAD achieves state-of-the-art accuracy on three prominent VAD scenarios.
Score: 12.492419773705898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: "PI-VAD", a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. PI-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones -- only during training. Notably, PI-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.

Related papers

AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z)
Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective [54.91271106816616]
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency.<n>We propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives.<n> Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps.<n>For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities.<n>For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework.
arXiv Detail & Related papers (2025-05-07T19:37:20Z)
Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection [53.2590751089607]
Real-IAD D3 is a high-precision multimodal dataset that incorporates an additional pseudo3D modality generated through photometric stereo.<n>We introduce an effective approach that integrates RGB, point cloud, and pseudo-3D depth information to leverage the complementary strengths of each modality.<n>Our experiments highlight the importance of these modalities in boosting detection robustness and overall IAD performance.
arXiv Detail & Related papers (2025-04-19T08:05:47Z)
CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework [30.734382771657312]
We propose a novel CM3AE pre-training framework for the RGB-Event perception.<n>This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels.<n>We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training.
arXiv Detail & Related papers (2025-04-17T01:49:46Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living [9.370655190768163]
We introduce PI-ViT, a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. $pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets.
arXiv Detail & Related papers (2023-11-30T18:59:56Z)
Single Frame Semantic Segmentation Using Multi-Modal Spherical Images [0.0]
We propose a transformer-based cross-modal fusion architecture to bridge the gap between multi-modal fusion and omnidirectional scene perception. We employ distortion-aware modules to address extreme object deformations and panorama distortions. Our technique achieved state-of-the-art mIoU performance: 60.60% on Stanford2D3DS (RGB-HHA), 71.97% Structured3D (RGB-D-N), and 35.92% Matterport3D (RGB-D)
arXiv Detail & Related papers (2023-08-18T08:06:18Z)
Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network. We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs. Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z)
Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities. We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement. Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z)
RGB-D Salient Object Detection with Cross-Modality Modulation and Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD) The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.