MxT: Mamba x Transformer for Image Inpainting
- URL: http://arxiv.org/abs/2407.16126v3
- Date: Thu, 15 Aug 2024 21:11:10 GMT
- Title: MxT: Mamba x Transformer for Image Inpainting
- Authors: Shuang Chen, Amir Atapour-Abarghouei, Haozheng Zhang, Hubert P. H. Shum,
- Abstract summary: Image inpainting aims to restore missing or damaged regions of images with semantically coherent content.
We introduce MxT composed of the proposed Hybrid Module (HM), which combines Mamba with the transformer in a synergistic manner.
Our HM facilitates dual-level interaction learning at both pixel and patch levels, greatly enhancing the model to reconstruct images with high quality and contextual accuracy.
- Score: 11.447968918063335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image inpainting, or image completion, is a crucial task in computer vision that aims to restore missing or damaged regions of images with semantically coherent content. This technique requires a precise balance of local texture replication and global contextual understanding to ensure the restored image integrates seamlessly with its surroundings. Traditional methods using Convolutional Neural Networks (CNNs) are effective at capturing local patterns but often struggle with broader contextual relationships due to the limited receptive fields. Recent advancements have incorporated transformers, leveraging their ability to understand global interactions. However, these methods face computational inefficiencies and struggle to maintain fine-grained details. To overcome these challenges, we introduce MxT composed of the proposed Hybrid Module (HM), which combines Mamba with the transformer in a synergistic manner. Mamba is adept at efficiently processing long sequences with linear computational costs, making it an ideal complement to the transformer for handling long-scale data interactions. Our HM facilitates dual-level interaction learning at both pixel and patch levels, greatly enhancing the model to reconstruct images with high quality and contextual accuracy. We evaluate MxT on the widely-used CelebA-HQ and Places2-standard datasets, where it consistently outperformed existing state-of-the-art methods. The code will be released: {\url{https://github.com/ChrisChen1023/MxT}}.
Related papers
- A Lightweight and Effective Image Tampering Localization Network with Vision Mamba [5.369780585789917]
Current image tampering localization methods rely on Convolutional Neural Networks (CNNs) and Transformers.
We propose a lightweight and effective FORensic network based on vision MAmba (ForMa) for blind image tampering localization.
arXiv Detail & Related papers (2025-02-14T06:35:44Z) - MatIR: A Hybrid Mamba-Transformer Image Restoration Model [95.17418386046054]
We propose a Mamba-Transformer hybrid image restoration model called MatIR.
MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features.
In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths.
arXiv Detail & Related papers (2025-01-30T14:55:40Z) - Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision [3.574664325523221]
We propose textbfContrast, a hybrid SR model that combines textbfConvolutional, textbfTransformer, and textbfState Space components.
By integrating transformer and state space mechanisms, textbfContrast compensates for the shortcomings of each approach, enhancing both global context modeling and pixel-level accuracy.
arXiv Detail & Related papers (2025-01-23T03:34:14Z) - DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation [4.391439322050918]
We introduce a novel state-space architecture for diffusion models.
We harness spatial and frequency information to enhance the inductive bias towards local features in input images.
arXiv Detail & Related papers (2024-11-06T18:59:17Z) - MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures.
It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z) - Cross-Scan Mamba with Masked Training for Robust Spectral Imaging [51.557804095896174]
We propose the Cross-Scanning Mamba, named CS-Mamba, that employs a Spatial-Spectral SSM for global-local balanced context encoding.
Experiment results show that our CS-Mamba achieves state-of-the-art performance and the masked training method can better reconstruct smooth features to improve the visual quality.
arXiv Detail & Related papers (2024-08-01T15:14:10Z) - PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement [7.443057703389351]
Underwater Image Enhancement (UIE) is critical for marine research and exploration but hindered by complex color distortions and severe blurring.
Recent deep learning-based methods have achieved remarkable results, yet these methods struggle with high computational costs and insufficient global modeling.
We present PixMamba, a novel architecture, designed to overcome these challenges by leveraging State Space Models (SSMs) for efficient global dependency modeling.
arXiv Detail & Related papers (2024-06-12T17:34:38Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - T-former: An Efficient Transformer for Image Inpainting [50.43302925662507]
A class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields.
In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion, and based on this attention, a network called $T$-former is designed for image inpainting.
Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity.
arXiv Detail & Related papers (2023-05-12T04:10:42Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.