Dimba: Transformer-Mamba Diffusion Models
- URL: http://arxiv.org/abs/2406.01159v1
- Date: Mon, 3 Jun 2024 09:51:59 GMT
- Title: Dimba: Transformer-Mamba Diffusion Models
- Authors: Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Youqiang Zhang, Junshi Huang,
- Abstract summary: This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements.
Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control.
- Score: 32.04949173308355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.
Related papers
- Adaptive Multi Scale Document Binarisation Using Vision Mamba [0.0]
We propose a Mamba-based architecture for document binarisation, which efficiently handles long sequences by scaling linearly and optimizing memory usage.
We introduce novel modifications to the skip connections by incorporating Difference of Gaussians (DoG) features, inspired by conventional signal processing techniques.
arXiv Detail & Related papers (2024-10-30T08:43:18Z) - MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures.
It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z) - Distillation-free Scaling of Large SSMs for Images and Videos [27.604572990625144]
State-space models (SSMs) have introduced a novel context modeling method by integrating state-space techniques into deep learning.
Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications.
We propose a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance.
arXiv Detail & Related papers (2024-09-18T10:48:10Z) - Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications.
Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features.
We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.