Mamba YOLO: SSMs-Based YOLO For Object Detection
- URL: http://arxiv.org/abs/2406.05835v1
- Date: Sun, 9 Jun 2024 15:56:19 GMT
- Title: Mamba YOLO: SSMs-Based YOLO For Object Detection
- Authors: Zeyu Wang, Chen Li, Huiying Xu, Xinzhong Zhu,
- Abstract summary: Mamba-YOLO is a novel object detection model based on State Space Models.
We show that Mamba-YOLO surpasses the existing YOLO series models in both performance and competitiveness.
- Score: 9.879086222226617
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Propelled by the rapid advancement of deep learning technologies, the YOLO series has set a new benchmark for real-time object detectors. Researchers have continuously explored innovative applications of reparameterization, efficient layer aggregation networks, and anchor-free techniques on the foundation of YOLO. To further enhance detection performance, Transformer-based structures have been introduced, significantly expanding the model's receptive field and achieving notable performance gains. However, such improvements come at a cost, as the quadratic complexity of the self-attention mechanism increases the computational burden of the model. Fortunately, the emergence of State Space Models (SSM) as an innovative technology has effectively mitigated the issues caused by quadratic complexity. In light of these advancements, we introduce Mamba-YOLO a novel object detection model based on SSM. Mamba-YOLO not only optimizes the SSM foundation but also adapts specifically for object detection tasks. Given the potential limitations of SSM in sequence modeling, such as insufficient receptive field and weak image locality, we have designed the LSBlock and RGBlock. These modules enable more precise capture of local image dependencies and significantly enhance the robustness of the model. Extensive experimental results on the publicly available benchmark datasets COCO and VOC demonstrate that Mamba-YOLO surpasses the existing YOLO series models in both performance and competitiveness, showcasing its substantial potential and competitive edge.The PyTorch code is available at:\url{https://github.com/HZAI-ZJNU/Mamba-YOLO}
Related papers
- Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection [18.65107742085838]
We present Mamba-YOLO-World, a novel YOLO-based OVD model employing the proposed MambaFusion Path Aggregation Network (MambaFusion-PAN) as its neck architecture.
Specifically, we introduce an innovative State Space Model-based feature fusion mechanism consisting of a Parallel-Guided Selective Scan algorithm and a Serial-Guided Selective Scan algorithm with linear complexity and globally guided receptive fields.
Experiments demonstrate that our model outperforms the original YOLO-World on the COCO and LVIS benchmarks in both zero-shot and fine-tuning settings.
arXiv Detail & Related papers (2024-09-13T03:23:52Z) - MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking.
Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations.
Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z) - Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation [74.65906322148997]
We introduce a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features.
Hyper-YOLO significantly outperforms the advanced YOLOv8-N and YOLOv9T with 12% $textval$ and 9% $APMoonLab improvements.
arXiv Detail & Related papers (2024-08-09T01:21:15Z) - Spatial Transformer Network YOLO Model for Agricultural Object Detection [0.3124884279860061]
We propose a new method that integrates spatial transformer networks (STNs) into YOLO to improve performance.
The proposed STN-YOLO aims to enhance the model's effectiveness by focusing on important areas of the image.
We apply the STN-YOLO on benchmark datasets for Agricultural object detection as well as a new dataset from a state-of-the-art plant phenotyping greenhouse facility.
arXiv Detail & Related papers (2024-07-31T14:53:41Z) - YOLOv10: Real-Time End-to-End Object Detection [68.28699631793967]
YOLOs have emerged as the predominant paradigm in the field of real-time object detection.
The reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs.
We introduce the holistic efficiency-accuracy driven model design strategy for YOLOs.
arXiv Detail & Related papers (2024-05-23T11:44:29Z) - FER-YOLO-Mamba: Facial Expression Detection and Classification Based on Selective State Space [9.68374853606234]
This paper presents the FER-YOLO-Mamba model, which integrates the principles of Mamba and YOLO technologies.
Within the FER-YOLO-Mamba model, we further devise a FER-YOLO-VSS dual-branch module, which combines the inherent strengths of convolutional layers in local feature extraction.
To the best of our knowledge, this is the first Vision Mamba model designed for facial expression detection and classification.
arXiv Detail & Related papers (2024-05-03T03:20:37Z) - VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting [11.058879849373572]
ViTs or CNNs with RNNs fortemporal forecasting have unparalleled results in predicting temporal and spatial dynamics.
Recent Mamba-based architecture has been met with enthusiasm for their exceptional long-sequence modeling capabilities.
We propose the VMRNN cell, a recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM.
arXiv Detail & Related papers (2024-03-25T08:26:42Z) - Understanding Self-attention Mechanism via Dynamical System Perspective [58.024376086269015]
Self-attention mechanism (SAM) is widely used in various fields of artificial intelligence.
We show that intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN)
We show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP.
arXiv Detail & Related papers (2023-08-19T08:17:41Z) - YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time
Object Detection [80.11152626362109]
We provide an efficient and performant object detector, termed YOLO-MS.
We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets.
Our work can also be used as a plug-and-play module for other YOLO models.
arXiv Detail & Related papers (2023-08-10T10:12:27Z) - DAMO-YOLO : A Report on Real-Time Object Detection Design [19.06518351354291]
We present a fast and accurate object detection method dubbed DAMO-YOLO, which achieves higher performance than the state-of-the-art YOLO series.
We use MAE-NAS, a method guided by the principle of maximum entropy, to search our detection backbone.
In the design of necks and heads, we follow the rule of large neck, small head''
arXiv Detail & Related papers (2022-11-23T17:59:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.