Related papers: MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

URL: http://arxiv.org/abs/2511.17929v1
Date: Sat, 22 Nov 2025 06:04:29 GMT
Title: MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection
Authors: Hui Lu, Yi Yu, Shijian Lu, Deepu Rajan, Boon Poh Ng, Alex C. Kot, Xudong Jiang,
Abstract summary: This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities.<n>MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.
Score: 94.12444452690329
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal Action Detection (TAD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in TAD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in TAD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. Additionally, traditional methods for TAD struggle with detecting long-span action instances due to a lack of global awareness and inefficient detection heads. This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaTAD comprises two novel designs that complement each other with superior TAD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaTAD tackles TAD in an end-to-end one-stage manner using a new state-space temporal adapter(SSTA) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.

Related papers

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving [47.573692944838115]
DriveMamba is a Task-Centric Scalable paradigm for efficient E2E-AD.<n>It integrates sequential task relation modeling, implicit correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder.<n>Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.
arXiv Detail & Related papers (2026-02-09T11:48:29Z)
AR-MOT: Autoregressive Multi-object Tracking [56.09738000988466]
We propose a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework.<n>This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads.<n>To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector.
arXiv Detail & Related papers (2026-01-05T09:17:28Z)
TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition [59.99922360648663]
TSkel-Mamba is a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics.<n>The MTI module employs multi-scale Cycle operators to capture cross-channel temporal interactions, a critical factor in action recognition.
arXiv Detail & Related papers (2025-12-12T11:55:16Z)
High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation [34.450956424316196]
Modeling high-resolutiontemporal representations is essential for video-based human pose estimation.<n>Current state-of-the-art methods typically have difficulties in balancing global and local dynamic modeling.<n>We present a framework that extends Mamba contexts to separately learn global and local high-resolution representations for VHPE.
arXiv Detail & Related papers (2025-10-13T05:18:27Z)
StateSpaceDiffuser: Bringing Long Context to Diffusion World Models [52.92249035412797]
We introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model.<n>This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models.
arXiv Detail & Related papers (2025-05-28T11:27:54Z)
MDDFNet: Mamba-based Dynamic Dual Fusion Network for Traffic Sign Detection [0.081585306387285]
We propose a novel object detection network, Mamba-based Dynamic Dual Fusion Network (MDDFNet) for traffic sign detection.<n>The network integrates a dynamic dual fusion module and a Mamba-based backbone to simultaneously tackle the aforementioned issues.<n>Extensive experiments conducted on the TT100K (Tsinghua-Tencent 100K) datasets demonstrate that MDDFNet outperforms other state-of-the-art detectors.
arXiv Detail & Related papers (2025-05-02T14:53:25Z)
A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention [11.999319439383918]
This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data.<n>A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a multi-scale attention mechanism.<n>BiLSTM captures both forward and backward dependencies in sequences, enhancing the model's ability to perceive global contextual structures.
arXiv Detail & Related papers (2025-04-21T16:53:02Z)
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection [11.534493974662304]
Temporal Action Detection (TAD) in untrimmed videos requires models that can efficiently process long-duration videos.<n>We propose Multi-Scale Temporal Mamba (MS-Temba), the first Mamba-based architecture specifically designed for densely labeled TAD tasks.<n>MS-Temba achieves state-of-the-art performance on long-duration videos, remains competitive on shorter segments, and reduces model complexity by 88%.
arXiv Detail & Related papers (2025-01-10T17:52:47Z)
STNMamba: Mamba-based Spatial-Temporal Normality Learning for Video Anomaly Detection [48.997518615379995]
Video anomaly detection (VAD) has been extensively researched due to its potential for intelligent video systems.<n>Most existing methods based on CNNs and transformers still suffer from substantial computational burdens.<n>We propose a lightweight and effective Mamba-based network named STNMamba to enhance the learning of spatial-temporal normality.
arXiv Detail & Related papers (2024-12-28T08:49:23Z)
SIGMA: Selective Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction.<n>We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation.<n>Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z)
MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations. Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.