Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video
- URL: http://arxiv.org/abs/2507.16342v1
- Date: Tue, 22 Jul 2025 08:23:51 GMT
- Title: Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video
- Authors: Alessandro Sebastiano Catinello, Giovanni Maria Farinella, Antonino Furnari,
- Abstract summary: Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips.<n>Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion.<n>We will publicly release the source code of Mamba-OTR to support future research.
- Score: 57.805927523341516
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This work tackles the problem of Online detection of Take and Release (OTR) of an object in untrimmed egocentric videos. This task is challenging due to severe label imbalance, with temporally sparse positive annotations, and the need for precise temporal predictions. Furthermore, methods need to be computationally efficient in order to be deployed in real-world online settings. To address these challenges, we propose Mamba-OTR, a model based on the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips. To address label imbalance, our training pipeline incorporates the focal loss and a novel regularization scheme that aligns model predictions with the evaluation metric. Extensive experiments on EPIC-KITCHENS-100, the comparisons with transformer-based approach, and the evaluation of different training and test schemes demonstrate the superiority of Mamba-OTR in both accuracy and efficiency. These finding are particularly evident when evaluating full-length videos or high frame-rate sequences, even when trained on short video snippets for computational convenience. The proposed Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in streaming mode, versus the 20.32 of a vanilla transformer and 25.16 of a vanilla Mamba, thus providing a strong baseline for OTR. We will publicly release the source code of Mamba-OTR to support future research.
Related papers
- MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba [0.5530212768657544]
Mamba, a State Space Model (SSM)-based model, has attracted attention as a potential alternative to Transformers.<n>We investigate the effectiveness of existing PEFT methods for Transformers when applied to Mamba.<n>We propose new Mamba-specific PEFT methods that leverage the distinctive structure of Mamba.
arXiv Detail & Related papers (2024-11-06T11:57:55Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show how to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.<n>The resulting hybrid model achieves performance comparable to the original Transformer in chat benchmarks.<n>We also introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - Snakes and Ladders: Two Steps Up for VideoMamba [10.954210339694841]
In this paper, we theoretically analyze the differences between self-attention and Mamba.
We propose VideoMambaPro models that surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1.
Our two solutions are to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.
arXiv Detail & Related papers (2024-06-27T08:45:31Z) - DeciMamba: Exploring the Length Extrapolation Potential of Mamba [89.07242846058023]
We introduce DeciMamba, a context-extension method specifically designed for Mamba.<n>Experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths significantly longer than the ones seen during training.
arXiv Detail & Related papers (2024-06-20T17:40:18Z) - Mamba-R: Vision Mamba ALSO Needs Registers [45.41648622999754]
Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba.<n>These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba.<n>To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba.
arXiv Detail & Related papers (2024-05-23T17:58:43Z) - MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism.
This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics.
We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z) - CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation [18.383760896304604]
This report introduces the first attempt to train a Mamba model utilizing contrastive technical-image pretraining (CLIP)
A Mamba model 67 million parameters is on par with a 307 million- parameters Vision Transformer (ViT) model in zero-shot classification tasks.
arXiv Detail & Related papers (2024-04-30T09:40:07Z) - ReMamber: Referring Image Segmentation with Mamba Twister [51.291487576255435]
ReMamber is a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block.
The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism.
arXiv Detail & Related papers (2024-03-26T16:27:37Z) - Is Mamba Effective for Time Series Forecasting? [30.85990093479062]
We propose a Mamba-based model named Simple-Mamba (S-Mamba) for time series forecasting.
Specifically, we tokenize the time points of each variate autonomously via a linear layer.
Experiments on thirteen public datasets prove that S-Mamba maintains low computational overhead and achieves leading performance.
arXiv Detail & Related papers (2024-03-17T08:50:44Z) - MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection [72.46396769642787]
We develop a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient infrared small target detection.
MiM-ISTD is $8 times$ faster than the SOTA method and reduces GPU memory usage by 62.2$%$ when testing on $2048 times 2048$ images.
arXiv Detail & Related papers (2024-03-04T15:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.