Vision Transformers for Dense Prediction
- URL: http://arxiv.org/abs/2103.13413v1
- Date: Wed, 24 Mar 2021 18:01:17 GMT
- Title: Vision Transformers for Dense Prediction
- Authors: Ren\'e Ranftl, Alexey Bochkovskiy, Vladlen Koltun
- Abstract summary: We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
- Score: 77.34726150561087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce dense vision transformers, an architecture that leverages vision
transformers in place of convolutional networks as a backbone for dense
prediction tasks. We assemble tokens from various stages of the vision
transformer into image-like representations at various resolutions and
progressively combine them into full-resolution predictions using a
convolutional decoder. The transformer backbone processes representations at a
constant and relatively high resolution and has a global receptive field at
every stage. These properties allow the dense vision transformer to provide
finer-grained and more globally coherent predictions when compared to
fully-convolutional networks. Our experiments show that this architecture
yields substantial improvements on dense prediction tasks, especially when a
large amount of training data is available. For monocular depth estimation, we
observe an improvement of up to 28% in relative performance when compared to a
state-of-the-art fully-convolutional network. When applied to semantic
segmentation, dense vision transformers set a new state of the art on ADE20K
with 49.02% mIoU. We further show that the architecture can be fine-tuned on
smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets
the new state of the art. Our models are available at
https://github.com/intel-isl/DPT.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - B-cos Alignment for Inherently Interpretable CNNs and Vision
Transformers [97.75725574963197]
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training.
We show that a sequence of such transformations induces a single linear transformation that faithfully summarises the full model computations.
We show that the resulting explanations are of high visual quality and perform well under quantitative interpretability metrics.
arXiv Detail & Related papers (2023-06-19T12:54:28Z) - CompletionFormer: Depth Completion with Convolutions and Vision
Transformers [0.0]
This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
Our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods.
arXiv Detail & Related papers (2023-04-25T17:59:47Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.