Related papers: Mamba YOLO: A Simple Baseline for Object Detection with State Space Model

Mamba YOLO: A Simple Baseline for Object Detection with State Space Model

URL: http://arxiv.org/abs/2406.05835v2
Date: Sat, 14 Dec 2024 03:50:47 GMT
Title: Mamba YOLO: A Simple Baseline for Object Detection with State Space Model
Authors: Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu, Hongbo Li,
Abstract summary: YOLO series has set a new benchmark for real-time object detectors.<n>Transformer-based structures have emerged as the most powerful solution.<n>However, the quadratic complexity of the self-attentive mechanism increases the computational burden.<n>We introduce a simple yet effective baseline approach called Mamba YOLO.
Score: 10.44725284994877
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Driven by the rapid development of deep learning technology, the YOLO series has set a new benchmark for real-time object detectors. Additionally, transformer-based structures have emerged as the most powerful solution in the field, greatly extending the model's receptive field and achieving significant performance improvements. However, this improvement comes at a cost as the quadratic complexity of the self-attentive mechanism increases the computational burden of the model. To address this problem, we introduce a simple yet effective baseline approach called Mamba YOLO. Our contributions are as follows: 1) We propose that the ODMamba backbone introduce a \textbf{S}tate \textbf{S}pace \textbf{M}odel (\textbf{SSM}) with linear complexity to address the quadratic complexity of self-attention. Unlike the other Transformer-base and SSM-base method, ODMamba is simple to train without pretraining. 2) For real-time requirement, we designed the macro structure of ODMamba, determined the optimal stage ratio and scaling size. 3) We design the RG Block that employs a multi-branch structure to model the channel dimensions, which addresses the possible limitations of SSM in sequence modeling, such as insufficient receptive fields and weak image localization. This design captures localized image dependencies more accurately and significantly. Extensive experiments on the publicly available COCO benchmark dataset show that Mamba YOLO achieves state-of-the-art performance compared to previous methods. Specifically, a tiny version of Mamba YOLO achieves a \textbf{7.5}\% improvement in mAP on a single 4090 GPU with an inference time of \textbf{1.5} ms. The pytorch code is available at: \url{https://github.com/HZAI-ZJNU/Mamba-YOLO}

Related papers

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning [54.19222454702032]
Continual Learning aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. State Space Models (SSMs) have achieved notable success in computer vision. We introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model.
arXiv Detail & Related papers (2024-11-23T06:36:16Z)
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection [18.65107742085838]
We present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture. Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields. Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings.
arXiv Detail & Related papers (2024-09-13T03:23:52Z)
SIGMA: Selective Gated Mamba for Sequential Recommendation [56.85338055215429]
Mamba, a recent advancement, has exhibited exceptional performance in time series prediction. We introduce a new framework named Selective Gated Mamba ( SIGMA) for Sequential Recommendation. Our results indicate that SIGMA outperforms current models on five real-world datasets.
arXiv Detail & Related papers (2024-08-21T09:12:59Z)
MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations. Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z)
Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation [74.65906322148997]
We introduce a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features. Hyper-YOLO significantly outperforms the advanced YOLOv8-N and YOLOv9T with 12% $textval$ and 9% $APMoonLab improvements.
arXiv Detail & Related papers (2024-08-09T01:21:15Z)
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity. Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z)
Spatial Transformer Network YOLO Model for Agricultural Object Detection [0.3124884279860061]
We propose a new method that integrates spatial transformer networks (STNs) into YOLO to improve performance. The proposed STN-YOLO aims to enhance the model's effectiveness by focusing on important areas of the image. We apply the STN-YOLO on benchmark datasets for Agricultural object detection as well as a new dataset from a state-of-the-art plant phenotyping greenhouse facility.
arXiv Detail & Related papers (2024-07-31T14:53:41Z)
YOLOv10: Real-Time End-to-End Object Detection [68.28699631793967]
YOLOs have emerged as the predominant paradigm in the field of real-time object detection. The reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs. We introduce the holistic efficiency-accuracy driven model design strategy for YOLOs.
arXiv Detail & Related papers (2024-05-23T11:44:29Z)
FER-YOLO-Mamba: Facial Expression Detection and Classification Based on Selective State Space [9.68374853606234]
This paper presents the FER-YOLO-Mamba model, which integrates the principles of Mamba and YOLO technologies. Within the FER-YOLO-Mamba model, we further devise a FER-YOLO-VSS dual-branch module, which combines the inherent strengths of convolutional layers in local feature extraction. To the best of our knowledge, this is the first Vision Mamba model designed for facial expression detection and classification.
arXiv Detail & Related papers (2024-05-03T03:20:37Z)
Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model [18.30032389736101]
Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. We present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction.
arXiv Detail & Related papers (2024-04-23T12:20:27Z)
MambaUIE&SR: Unraveling the Ocean's Secrets with Only 2.8 GFLOPs [1.7648680700685022]
Underwater Image Enhancement (UIE) techniques aim to address the problem of underwater image degradation due to light absorption and scattering. Recent years, both Convolution Neural Network (CNN)-based and Transformer-based methods have been widely explored. MambaUIE is able to efficiently synthesize global and local information and maintains a very small number of parameters with high accuracy.
arXiv Detail & Related papers (2024-04-22T05:12:11Z)
VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting [11.058879849373572]
ViTs or CNNs with RNNs fortemporal forecasting have unparalleled results in predicting temporal and spatial dynamics. Recent Mamba-based architecture has been met with enthusiasm for their exceptional long-sequence modeling capabilities. We propose the VMRNN cell, a recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM.
arXiv Detail & Related papers (2024-03-25T08:26:42Z)
ZigMa: A DiT-style Zigzag Mamba Diffusion Model [22.68317748373856]
We aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. We introduce a simple, plug-and-play, zero- parameter method named Zigzag Mamba, which outperforms Mamba-based baselines. We integrate Zigzag Mamba with Interpolant framework to investigate the scalability of the model on large-resolution visual datasets.
arXiv Detail & Related papers (2024-03-20T17:59:14Z)
PointMamba: A Simple State Space Model for Point Cloud Analysis [65.59944745840866]
We propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks. Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs.
arXiv Detail & Related papers (2024-02-16T14:56:13Z)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module. We identify that a key weakness of such models is their inability to perform content-based reasoning. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba) As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z)
Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence. We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN) We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z)
YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection [80.11152626362109]
We provide an efficient and performant object detector, termed YOLO-MS. We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets. Our work can also be used as a plug-and-play module for other YOLO models.
arXiv Detail & Related papers (2023-08-10T10:12:27Z)
GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds [72.60362979456035]
Masked Autoencoders (MAE) are challenging to explore in large-scale 3D point clouds. We propose a textbfGenerative textbfDecoder for MAE (GD-MAE) to automatically merges the surrounding context. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: KITTI, and ONCE.
arXiv Detail & Related papers (2022-12-06T14:32:55Z)
DAMO-YOLO : A Report on Real-Time Object Detection Design [19.06518351354291]
We present a fast and accurate object detection method dubbed DAMO-YOLO, which achieves higher performance than the state-of-the-art YOLO series. We use MAE-NAS, a method guided by the principle of maximum entropy, to search our detection backbone. In the design of necks and heads, we follow the rule of large neck, small head''
arXiv Detail & Related papers (2022-11-23T17:59:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.