Related papers: Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

URL: http://arxiv.org/abs/2506.03162v2
Date: Thu, 25 Sep 2025 21:26:24 GMT
Title: Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection
Authors: Damith Chamalke Senadeera, Xiaoyun Yang, Shibo Li, Muhammad Awais, Dimitrios Kollias, Gregory Slabaugh,
Abstract summary: We present a new benchmark by merging RWF-2000, RLVS, SURV and VioPeru datasets in video violence detection.<n> Experimental results demonstrate that our model achieves state-of-the-art performance on this benchmark.<n> demonstrating the promise of SSMs for scalable, near real-time surveillance violence detection.
Score: 29.267947325070164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics. The model performs continuous fusion via a gating mechanism between the branches to enhance the model's ability to detect violent activities even in challenging surveillance scenarios. We also present a new benchmark by merging RWF-2000, RLVS, SURV and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Experimental results demonstrate that our model achieves state-of-the-art performance on this benchmark and also on DVD dataset which is another novel dataset on video violence detection, offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, near real-time surveillance violence detection.

Related papers

MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection [94.12444452690329]
This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities.<n>MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.
arXiv Detail & Related papers (2025-11-22T06:04:29Z)
M2S2L: Mamba-based Multi-Scale Spatial-temporal Learning for Video Anomaly Detection [18.108479842983822]
Video anomaly detection (VAD) is an essential task in the image processing community with prospects in video surveillance.<n>Traditional VAD approaches struggle to provide robust assessment for modern surveillance systems.<n>We present a Mamba-based multi-scale spatial-temporal learning framework in this paper.
arXiv Detail & Related papers (2025-11-04T04:00:23Z)
Trajectory-aware Shifted State Space Models for Online Video Super-Resolution [57.87099307245989]
This paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba)<n>TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames.<n>Our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% reduction complexity (in MACs)
arXiv Detail & Related papers (2025-08-14T08:42:15Z)
Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos [58.156141601478794]
Multi-object tracking (UAVT) aims to track multiple objects while maintaining consistent identities across frames of a given video.<n>Existing methods typically model motion cues and appearance separately, overlooking their interplay and resulting in suboptimal tracking performance.<n>We propose AMOT, which exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module.
arXiv Detail & Related papers (2025-08-03T12:06:47Z)
Intelligent Image Sensing for Crime Analysis: A ML Approach towards Enhanced Violence Detection and Investigation [1.8219466405383231]
This paper introduces a comprehensive framework for violence detection and classification, employing Supervised Learning for both binary and multi-class violence classification.<n>Training is conducted on a diverse customized datasets with frame-level annotations, incorporating videos from surveillance cameras, human recordings, hockey fight, sohas and wvd dataset across various platforms.
arXiv Detail & Related papers (2025-06-16T18:39:16Z)
RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement [59.364418120895]
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications.<n>We develop a novel relation-driven Mamba framework for effective UIE (RD-UIE)<n>Experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba.
arXiv Detail & Related papers (2025-05-02T12:21:44Z)
MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment [24.053542031123985]
We present MVQA, a Mamba-based model designed for efficient video quality assessment (VQA)<n>USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos.<n>Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods.
arXiv Detail & Related papers (2025-04-22T16:08:23Z)
MAUCell: An Adaptive Multi-Attention Framework for Video Frame Prediction [0.0]
We introduce the Multi-Attention Unit (MAUCell) which combines Generative Adrative Networks (GANs) and attention mechanisms to improve video prediction.<n>The new design system maintains equilibrium between temporal continuity and spatial accuracy to deliver reliable video prediction.
arXiv Detail & Related papers (2025-01-28T14:52:10Z)
Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z)
Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints [51.83081671798784]
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability.<n>DiT's practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference.<n>We propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets.
arXiv Detail & Related papers (2024-11-26T17:28:10Z)
DemMamba: Alignment-free Raw Video Demoireing with Frequency-assisted Spatio-Temporal Mamba [18.06907326360215]
Moire patterns, resulting from the interference of two similar repetitive patterns, are frequently observed during the capture of images or videos on screens. This paper introduces a novel alignment-free raw video demoireing network with frequency-assisted-temporal Mamba. Our proposed DemMamba surpasses state-of-the-art methods by 1.3 dB in PSNR, and also provides a satisfactory visual experience.
arXiv Detail & Related papers (2024-08-20T09:31:03Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention [21.354382437543315]
We introduce CUE-Net, a novel architecture designed for automated violence detection in video surveillance. By focusing on both local and globaltemporal features, CUE-Net achieves state-of-the-art performance on the RWF-2000 and RLVS datasets.
arXiv Detail & Related papers (2024-04-27T20:09:40Z)
Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe. Recent deep learning-based VAD models have shown promising results by generating high-resolution frames. We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z)
Inter-frame Accelerate Attack against Video Interpolation Models [73.28751441626754]
We apply adversarial attacks to VIF models and find that the VIF models are very vulnerable to adversarial examples. We propose a novel attack method named Inter-frame Accelerate Attack (IAA) thats the iterations as the perturbation for the previous adjacent frame. It is shown that our method can improve attack efficiency greatly while achieving comparable attack performance with traditional methods.
arXiv Detail & Related papers (2023-05-11T03:08:48Z)
Detecting Violence in Video Based on Deep Features Fusion Technique [0.30458514384586394]
This work proposed a novel method to detect violence using a fusion tech-nique of two convolutional neural networks (CNNs) The performance of the proposed method is evaluated using three standard benchmark datasets in terms of detection accuracy.
arXiv Detail & Related papers (2022-04-15T12:51:20Z)
Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring. FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames. Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z)
MotionHint: Self-Supervised Monocular Visual Odometry with Motion Constraints [70.76761166614511]
We present a novel self-supervised algorithm named MotionHint for monocular visual odometry (VO) Our MotionHint algorithm can be easily applied to existing open-sourced state-of-the-art SSM-VO systems.
arXiv Detail & Related papers (2021-09-14T15:35:08Z)
Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet. SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution. Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z)
ASFD: Automatic and Scalable Face Detector [129.82350993748258]
We propose a novel Automatic and Scalable Face Detector (ASFD) ASFD is based on a combination of neural architecture search techniques as well as a new loss design. Our ASFD-D6 outperforms the prior strong competitors, and our lightweight ASFD-D0 runs at more than 120 FPS with Mobilenet for VGA-resolution images.
arXiv Detail & Related papers (2020-03-25T06:00:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.