XCiT: Cross-Covariance Image Transformers
- URL: http://arxiv.org/abs/2106.09681v2
- Date: Fri, 18 Jun 2021 15:33:31 GMT
- Title: XCiT: Cross-Covariance Image Transformers
- Authors: Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski,
Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel
Synnaeve, Jakob Verbeek, Herv\'e Jegou
- Abstract summary: We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
- Score: 73.33400159139708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following their success in natural language processing, transformers have
recently shown much promise for computer vision. The self-attention operation
underlying transformers yields global interactions between all tokens ,i.e.
words or image patches, and enables flexible modelling of image data beyond the
local interactions of convolutions. This flexibility, however, comes with a
quadratic complexity in time and memory, hindering application to long
sequences and high-resolution images. We propose a "transposed" version of
self-attention that operates across feature channels rather than tokens, where
the interactions are based on the cross-covariance matrix between keys and
queries. The resulting cross-covariance attention (XCA) has linear complexity
in the number of tokens, and allows efficient processing of high-resolution
images. Our cross-covariance image transformer (XCiT) is built upon XCA. It
combines the accuracy of conventional transformers with the scalability of
convolutional architectures. We validate the effectiveness and generality of
XCiT by reporting excellent results on multiple vision benchmarks, including
image classification and self-supervised feature learning on ImageNet-1k,
object detection and instance segmentation on COCO, and semantic segmentation
on ADE20k.
Related papers
- Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers [13.480259378415505]
BiXT scales linearly with input size in terms of computational cost and memory consumption.
BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module.
By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences.
arXiv Detail & Related papers (2024-02-19T13:38:15Z) - Learning A Sparse Transformer Network for Effective Image Deraining [42.01684644627124]
We propose an effective DeRaining network, Sparse Transformer (DRSformer)
We develop a learnable top-k selection operator to adaptively retain the most crucial attention scores from the keys for each query for better feature aggregation.
We equip our model with mixture of experts feature compensator to present a cooperation refinement deraining scheme.
arXiv Detail & Related papers (2023-03-21T15:41:57Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.