Dual-Flattening Transformers through Decomposed Row and Column Queries
for Semantic Segmentation
- URL: http://arxiv.org/abs/2201.09139v1
- Date: Sat, 22 Jan 2022 22:38:15 GMT
- Title: Dual-Flattening Transformers through Decomposed Row and Column Queries
for Semantic Segmentation
- Authors: Ying Wang, Chiuman Ho, Wenju Xu, Ziwei Xuan, Xudong Liu and Guo-Jun Qi
- Abstract summary: We propose a Dual-Flattening Transformer (DFlatFormer) to enable high-resolution output.
Experiments on ADE20K and Cityscapes datasets demonstrate the superiority of the proposed dual-flattening transformer architecture.
- Score: 50.321277476317974
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It is critical to obtain high resolution features with long range dependency
for dense prediction tasks such as semantic segmentation. To generate
high-resolution output of size $H\times W$ from a low-resolution feature map of
size $h\times w$ ($hw\ll HW$), a naive dense transformer incurs an intractable
complexity of $\mathcal{O}(hwHW)$, limiting its application on high-resolution
dense prediction. We propose a Dual-Flattening Transformer (DFlatFormer) to
enable high-resolution output by reducing complexity to $\mathcal{O}(hw(H+W))$
that is multiple orders of magnitude smaller than the naive dense transformer.
Decomposed queries are presented to retrieve row and column attentions
tractably through separate transformers, and their outputs are combined to form
a dense feature map at high resolution. To this end, the input sequence fed
from an encoder is row-wise and column-wise flattened to align with decomposed
queries by preserving their row and column structures, respectively. Row and
column transformers also interact with each other to capture their mutual
attentions with the spatial crossings between rows and columns. We also propose
to perform attentions through efficient grouping and pooling to further reduce
the model complexity. Extensive experiments on ADE20K and Cityscapes datasets
demonstrate the superiority of the proposed dual-flattening transformer
architecture with higher mIoUs.
Related papers
- Separations in the Representational Capabilities of Transformers and Recurrent Architectures [27.783705012503237]
We analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance.
We show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size.
We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass.
arXiv Detail & Related papers (2024-06-13T17:31:30Z) - RegFormer: An Efficient Projection-Aware Transformer Network for
Large-Scale Point Cloud Registration [73.69415797389195]
We propose an end-to-end transformer network (RegFormer) for large-scale point cloud alignment.
Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers.
Our transformer has linear complexity, which guarantees high efficiency even for large-scale scenes.
arXiv Detail & Related papers (2023-03-22T08:47:37Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot
Object Detection [35.54153749138406]
We propose a Time-rEversed diffusioN tEnsor Transformer (TENET) that captures multi-way feature occurrences that are highly discriminative.
We also propose a Transformer Relation Head (TRH) equipped with higher-order representations, which encodes correlations between query regions and the entire support set.
Our model achieves state-of-the-art results on PASCAL VOC, FSOD, and COCO.
arXiv Detail & Related papers (2022-10-30T17:40:12Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.