Related papers: Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation

Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation

URL: http://arxiv.org/abs/2201.09139v1
Date: Sat, 22 Jan 2022 22:38:15 GMT
Title: Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation
Authors: Ying Wang, Chiuman Ho, Wenju Xu, Ziwei Xuan, Xudong Liu and Guo-Jun Qi
Abstract summary: We propose a Dual-Flattening Transformer (DFlatFormer) to enable high-resolution output. Experiments on ADE20K and Cityscapes datasets demonstrate the superiority of the proposed dual-flattening transformer architecture.
Score: 50.321277476317974
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: It is critical to obtain high resolution features with long range dependency for dense prediction tasks such as semantic segmentation. To generate high-resolution output of size $H\times W$ from a low-resolution feature map of size $h\times w$ ($hw\ll HW$), a naive dense transformer incurs an intractable complexity of $\mathcal{O}(hwHW)$, limiting its application on high-resolution dense prediction. We propose a Dual-Flattening Transformer (DFlatFormer) to enable high-resolution output by reducing complexity to $\mathcal{O}(hw(H+W))$ that is multiple orders of magnitude smaller than the naive dense transformer. Decomposed queries are presented to retrieve row and column attentions tractably through separate transformers, and their outputs are combined to form a dense feature map at high resolution. To this end, the input sequence fed from an encoder is row-wise and column-wise flattened to align with decomposed queries by preserving their row and column structures, respectively. Row and column transformers also interact with each other to capture their mutual attentions with the spatial crossings between rows and columns. We also propose to perform attentions through efficient grouping and pooling to further reduce the model complexity. Extensive experiments on ADE20K and Cityscapes datasets demonstrate the superiority of the proposed dual-flattening transformer architecture with higher mIoUs.

Related papers

HeterRec: Heterogeneous Information Transformer for Scalable Sequential Recommendation [21.435064492654494]
HeterRec is a sequential recommendation model that integrates item-side heterogeneous features. HeterRec incorporates Heterogeneous Token Flatten Layer (HTFL) and Hierarchical Causal Transformer Layer (HCT) Extensive experiments on both offline and online datasets show that the HeterRec model achieves superior performance.
arXiv Detail & Related papers (2025-03-03T12:23:54Z)
MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers [43.39466934693055]
We present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. We conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2024-11-20T02:41:53Z)
Separations in the Representational Capabilities of Transformers and Recurrent Architectures [27.783705012503237]
We analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance. We show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass.
arXiv Detail & Related papers (2024-06-13T17:31:30Z)
RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration [73.69415797389195]
We propose an end-to-end transformer network (RegFormer) for large-scale point cloud alignment. Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers. Our transformer has linear complexity, which guarantees high efficiency even for large-scale scenes.
arXiv Detail & Related papers (2023-03-22T08:47:37Z)
DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA) DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z)
Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection [35.54153749138406]
We propose a Time-rEversed diffusioN tEnsor Transformer (TENET) that captures multi-way feature occurrences that are highly discriminative. We also propose a Transformer Relation Head (TRH) equipped with higher-order representations, which encodes correlations between query regions and the entire support set. Our model achieves state-of-the-art results on PASCAL VOC, FSOD, and COCO.
arXiv Detail & Related papers (2022-10-30T17:40:12Z)
Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling. It incorporates all token interactions within one attention layer while maintaining low computation and memory costs. We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z)
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.