Related papers: BiFormer: Vision Transformer with Bi-Level Routing Attention

BiFormer: Vision Transformer with Bi-Level Routing Attention

URL: http://arxiv.org/abs/2303.08810v1
Date: Wed, 15 Mar 2023 17:58:46 GMT
Title: BiFormer: Vision Transformer with Bi-Level Routing Attention
Authors: Lei Zhu and Xinjiang Wang and Zhanghan Ke and Wayne Zhang and Rynson Lau
Abstract summary: We propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented.
Score: 26.374724782056557
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (\ie, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a \textbf{query adaptive} manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at \url{https://github.com/rayleizhu/BiFormer}.

Related papers

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention [11.338273151173427]
We propose the Deformable Bi-level Routing Attention (DBRA) module to optimize the selection of key-value pairs. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2024-10-11T07:23:10Z)
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters [10.403248386029407]
Self-attention is a significant computational bottleneck due to its complexity in the sequence length. In this work, we derive the scalar energy function whose gradient computes the self-attention block. Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction.
arXiv Detail & Related papers (2024-08-07T21:16:55Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
EcoFormer: Energy-Saving Attention with Linear Complexity [40.002608785252164]
Transformer is a transformative framework that models sequential data. We propose a new binarization paradigm customized to high-dimensional softmax attention. We show that EcoFormer consistently achieves comparable performance with standard attentions.
arXiv Detail & Related papers (2022-09-19T13:28:32Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights. We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z)
PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy. Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Coordinate Attention for Efficient Mobile Network Design [96.40415345942186]
We propose a novel attention mechanism for mobile networks by embedding positional information into channel attention. Unlike channel attention that transforms a feature tensor to a single feature vector via 2D global pooling, the coordinate attention factorizes channel attention into two 1D feature encoding processes. Our coordinate attention is beneficial to ImageNet classification and behaves better in down-stream tasks, such as object detection and semantic segmentation.
arXiv Detail & Related papers (2021-03-04T09:18:02Z)
Towards Efficient Scene Understanding via Squeeze Reasoning [71.1139549949694]
We propose a novel framework called Squeeze Reasoning. Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector. We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks.
arXiv Detail & Related papers (2020-11-06T12:17:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.