MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
- URL: http://arxiv.org/abs/2407.02228v2
- Date: Sun, 14 Jul 2024 07:50:04 GMT
- Title: MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
- Authors: Baijiong Lin, Weisen Jiang, Pengguang Chen, Yu Zhang, Shu Liu, Ying-Cong Chen,
- Abstract summary: We propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding.
Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods.
- Score: 27.487314321249627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.
Related papers
- MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications.
Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features.
We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z) - CDMamba: Remote Sensing Image Change Detection with Mamba [30.387208446303944]
We propose a model called CDMamba, which effectively combines global and local features for handling CD tasks.
Specifically, the Scaled Residual ConvMamba block is proposed to utilize the ability of Mamba to extract global features and convolution to enhance the local details.
arXiv Detail & Related papers (2024-06-06T16:04:30Z) - MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism.
This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics.
We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z) - Visual Mamba: A Survey and New Outlooks [33.90213491829634]
Mamba, a recent selective structured state space model, excels in long sequence modeling.
Since January 2024, Mamba has been actively applied to diverse computer vision tasks.
This paper reviews visual Mamba approaches, analyzing over 200 papers.
arXiv Detail & Related papers (2024-04-29T16:51:30Z) - RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers [2.8554857235549753]
Transformer architecture's core mechanism -- attention requires $O(n2)$ time complexity in training and $O(n)$ time complexity in inference.
A notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in sequence modeling tasks.
We find that Mamba models achieve competitive performance compared to transformer-based models with the same training recipe.
arXiv Detail & Related papers (2024-03-27T06:07:05Z) - ReMamber: Referring Image Segmentation with Mamba Twister [51.291487576255435]
ReMamber is a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block.
The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism.
arXiv Detail & Related papers (2024-03-26T16:27:37Z) - MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection [72.46396769642787]
We develop a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient infrared small target detection.
MiM-ISTD is $8 times$ faster than the SOTA method and reduces GPU memory usage by 62.2$%$ when testing on $2048 times 2048$ images.
arXiv Detail & Related papers (2024-03-04T15:57:29Z) - Is Mamba Capable of In-Context Learning? [63.682741783013306]
State of the art foundation models such as GPT-4 perform surprisingly well at in-context learning (ICL)
This work provides empirical evidence that Mamba, a newly proposed state space model, has similar ICL capabilities.
arXiv Detail & Related papers (2024-02-05T16:39:12Z) - SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image
Segmentation [17.676472608152704]
We introduce SegMamba, a novel 3D medical image textbfSegmentation textbfMamba model.
SegMamba excels in whole volume feature modeling from a state space model standpoint.
Experiments on the BraTS2023 dataset demonstrate the effectiveness and efficiency of our SegMamba.
arXiv Detail & Related papers (2024-01-24T16:17:23Z) - MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning [82.62433731378455]
We show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales.
We propose a novel architecture, namely MTI-Net, that builds upon this finding.
arXiv Detail & Related papers (2020-01-19T21:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.