STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection
- URL: http://arxiv.org/abs/2412.20084v1
- Date: Sat, 28 Dec 2024 08:49:23 GMT
- Title: STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection
- Authors: Zhangxun Li, Mengyang Zhao, Xuan Yang, Yang Liu, Jiamu Sheng, Xinhua Zeng, Tian Wang, Kewei Wu, Yu-Gang Jiang,
- Abstract summary: Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems.
Most existing methods based on CNNs and transformers still suffer from substantial computational burdens.
We propose a lightweight and effective Mamba-based network named STNMamba to enhance the learning of spatial-temporal normality.
- Score: 48.997518615379995
- License:
- Abstract: Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems. However, most existing methods based on CNNs and transformers still suffer from substantial computational burdens and have room for improvement in learning spatial-temporal normality. Recently, Mamba has shown great potential for modeling long-range dependencies with linear complexity, providing an effective solution to the above dilemma. To this end, we propose a lightweight and effective Mamba-based network named STNMamba, which incorporates carefully designed Mamba modules to enhance the learning of spatial-temporal normality. Firstly, we develop a dual-encoder architecture, where the spatial encoder equipped with Multi-Scale Vision Space State Blocks (MS-VSSB) extracts multi-scale appearance features, and the temporal encoder employs Channel-Aware Vision Space State Blocks (CA-VSSB) to capture significant motion patterns. Secondly, a Spatial-Temporal Interaction Module (STIM) is introduced to integrate spatial and temporal information across multiple levels, enabling effective modeling of intrinsic spatial-temporal consistency. Within this module, the Spatial-Temporal Fusion Block (STFB) is proposed to fuse the spatial and temporal features into a unified feature space, and the memory bank is utilized to store spatial-temporal prototypes of normal patterns, restricting the model's ability to represent anomalies. Extensive experiments on three benchmark datasets demonstrate that our STNMamba achieves competitive performance with fewer parameters and lower computational costs than existing methods.
Related papers
- Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning [54.19222454702032]
Continual Learning aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge.
State Space Models (SSMs) have achieved notable success in computer vision.
We introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model.
arXiv Detail & Related papers (2024-11-23T06:36:16Z) - Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion [46.82975707531064]
Selective state space models (SSMs) excel at capturing long-range dependencies in 1D sequential data.
We propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space.
We show that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation.
arXiv Detail & Related papers (2024-10-19T12:56:58Z) - CollaMamba: Efficient Collaborative Perception with Cross-Agent Spatial-Temporal State Space Model [12.461378793357705]
Multi-agent collaborative perception fosters a deeper understanding of the environment.
Recent studies on collaborative perception mostly utilize CNNs or Transformers to learn feature representation and fusion in the spatial dimension.
We propose a resource efficient cross-agent spatial-temporal collaborative state space model (SSM), named CollaMamba.
arXiv Detail & Related papers (2024-09-12T02:50:04Z) - MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations.
Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z) - PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model [7.286873011001679]
We propose a purely SSM-based approach with linear correlations for complexityD human pose estimation in monocular video video.
Specifically, we propose a bidirectional global temporal-local-temporal block that comprehensively models human joint relations within individual frames as well as across frames.
This strategy provides a more logical geometric ordering strategy, resulting in a combined-local spatial scan.
arXiv Detail & Related papers (2024-08-07T04:38:03Z) - RSCaMa: Remote Sensing Image Change Captioning with State Space Model [29.945966783242337]
Remote Sensing Image Change Captioning (RSICC) aims to describe surface changes between multi-temporal remote sensing images in language.
This poses challenges to spatial and temporal modeling of bi-temporal features.
We propose a novel RSCaMa model, which achieves efficient joint spatial-temporal modeling through multiple CaMa layers.
arXiv Detail & Related papers (2024-04-29T17:31:00Z) - S$^2$Mamba: A Spatial-spectral State Space Model for Hyperspectral Image Classification [44.99672241508994]
Land cover analysis using hyperspectral images (HSI) remains an open problem due to their low spatial resolution and complex spectral information.
We propose S$2$Mamba, a spatial-spectral state space model for hyperspectral image classification, to excavate spatial-spectral contextual features.
arXiv Detail & Related papers (2024-04-28T15:12:56Z) - MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection [53.03687787922032]
Mamba-based models with superior long-range modeling and linear efficiency have garnered substantial attention.
MambaAD consists of a pre-trained encoder and a Mamba decoder featuring (Locality-Enhanced State Space) LSS modules at multi-scales.
The proposed LSS module, integrating parallel cascaded (Hybrid State Space) HSS blocks and multi- kernel convolutions operations, effectively captures both long-range and local information.
arXiv Detail & Related papers (2024-04-09T18:28:55Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.