Dual Vision Transformer
- URL: http://arxiv.org/abs/2207.04976v2
- Date: Tue, 12 Jul 2022 08:26:22 GMT
- Title: Dual Vision Transformer
- Authors: Ting Yao and Yehao Li and Yingwei Pan and Yu Wang and Xiao-Ping Zhang
and Tao Mei
- Abstract summary: We propose a novel Transformer architecture that aims to mitigate the cost issue, named Dual Vision Transformer (Dual-ViT)
The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity.
We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with reduced training complexity.
- Score: 114.1062057736447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior works have proposed several strategies to reduce the computational cost
of self-attention mechanism. Many of these works consider decomposing the
self-attention procedure into regional and local feature extraction procedures
that each incurs a much smaller computational complexity. However, regional
information is typically only achieved at the expense of undesirable
information lost owing to down-sampling. In this paper, we propose a novel
Transformer architecture that aims to mitigate the cost issue, named Dual
Vision Transformer (Dual-ViT). The new architecture incorporates a critical
semantic pathway that can more efficiently compress token vectors into global
semantics with reduced order of complexity. Such compressed global semantics
then serve as useful prior information in learning finer pixel level details,
through another constructed pixel pathway. The semantic pathway and pixel
pathway are then integrated together and are jointly trained, spreading the
enhanced self-attention information in parallel through both of the pathways.
Dual-ViT is henceforth able to reduce the computational complexity without
compromising much accuracy. We empirically demonstrate that Dual-ViT provides
superior accuracy than SOTA Transformer architectures with reduced training
complexity. Source code is available at
\url{https://github.com/YehLi/ImageNetModel}.
Related papers
- Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining.
A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery.
Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Lightweight Bimodal Network for Single-Image Super-Resolution via
Symmetric CNN and Recursive Transformer [27.51790638626891]
Single-image super-resolution (SISR) has achieved significant breakthroughs with the development of deep learning.
To solve this issue, we propose a Lightweight Bimodal Network (LBNet) for SISR.
Specifically, an effective Symmetric CNN is designed for local feature extraction and coarse image reconstruction.
arXiv Detail & Related papers (2022-04-28T04:43:22Z) - Unleashing the Power of Transformer for Graphs [28.750700720796836]
Transformer suffers from the scalability problem when dealing with graphs.
We propose a new Transformer architecture, named dual-encoding Transformer (DET)
DET has a structural encoder to aggregate information from connected neighbors and a semantic encoder to focus on semantically useful distant nodes.
arXiv Detail & Related papers (2022-02-18T06:40:51Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [6.646135062704341]
Transformer architecture has emerged to be successful in a number of natural language processing tasks.
We present UTNet, a powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation.
arXiv Detail & Related papers (2021-07-02T00:56:27Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Multi-Stage Progressive Image Restoration [167.6852235432918]
We propose a novel synergistic design that can optimally balance these competing goals.
Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs.
The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets.
arXiv Detail & Related papers (2021-02-04T18:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.