MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor
Formula for Image Dehazing
- URL: http://arxiv.org/abs/2308.14036v2
- Date: Wed, 30 Aug 2023 13:27:35 GMT
- Title: MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor
Formula for Image Dehazing
- Authors: Yuwei Qiu, Kaihao Zhang, Chenxi Wang, Wenhan Luo, Hongdong Li and Zhi
Jin
- Abstract summary: Transformer networks are beginning to replace pure convolutional neural networks (CNNs) in the field of computer vision.
We propose a new Transformer variant, which applies the Taylor expansion to approximate the softmax-attention and achieves linear computational complexity.
We introduce a multi-branch architecture with multi-scale patch embedding to the proposed Transformer, which embeds features by overlapping deformable convolution of different scales.
Our model, named Multi-branch Transformer expanded by Taylor formula (MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch embedding stage and capture long-distance pixel interactions with limited computational cost
- Score: 88.61523825903998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Transformer networks are beginning to replace pure
convolutional neural networks (CNNs) in the field of computer vision due to
their global receptive field and adaptability to input. However, the quadratic
computational complexity of softmax-attention limits the wide application in
image dehazing task, especially for high-resolution images. To address this
issue, we propose a new Transformer variant, which applies the Taylor expansion
to approximate the softmax-attention and achieves linear computational
complexity. A multi-scale attention refinement module is proposed as a
complement to correct the error of the Taylor expansion. Furthermore, we
introduce a multi-branch architecture with multi-scale patch embedding to the
proposed Transformer, which embeds features by overlapping deformable
convolution of different scales. The design of multi-scale patch embedding is
based on three key ideas: 1) various sizes of the receptive field; 2)
multi-level semantic information; 3) flexible shapes of the receptive field.
Our model, named Multi-branch Transformer expanded by Taylor formula
(MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch
embedding stage and capture long-distance pixel interactions with limited
computational cost. Experimental results on several dehazing benchmarks show
that MB-TaylorFormer achieves state-of-the-art (SOTA) performance with a light
computational burden. The source code and pre-trained models are available at
https://github.com/FVL2020/ICCV-2023-MB-TaylorFormer.
Related papers
- Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time.
By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z) - T-former: An Efficient Transformer for Image Inpainting [50.43302925662507]
A class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields.
In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion, and based on this attention, a network called $T$-former is designed for image inpainting.
Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity.
arXiv Detail & Related papers (2023-05-12T04:10:42Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - Optimizing Vision Transformers for Medical Image Segmentation and
Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections.
CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase.
Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z) - CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image
Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation.
We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.