Contextual Transformer Networks for Visual Recognition
- URL: http://arxiv.org/abs/2107.12292v1
- Date: Mon, 26 Jul 2021 16:00:21 GMT
- Title: Contextual Transformer Networks for Visual Recognition
- Authors: Yehao Li and Ting Yao and Yingwei Pan and Tao Mei
- Abstract summary: We design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition.
Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix.
Our CoT block is appealing in the view that it can readily replace each $3times3$ convolution in ResNet architectures.
- Score: 103.79062359677452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer with self-attention has led to the revolutionizing of natural
language processing field, and recently inspires the emergence of
Transformer-style architecture design with competitive results in numerous
computer vision tasks. Nevertheless, most of existing designs directly employ
self-attention over a 2D feature map to obtain the attention matrix based on
pairs of isolated queries and keys at each spatial location, but leave the rich
contexts among neighbor keys under-exploited. In this work, we design a novel
Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual
recognition. Such design fully capitalizes on the contextual information among
input keys to guide the learning of dynamic attention matrix and thus
strengthens the capacity of visual representation. Technically, CoT block first
contextually encodes input keys via a $3\times3$ convolution, leading to a
static contextual representation of inputs. We further concatenate the encoded
keys with input queries to learn the dynamic multi-head attention matrix
through two consecutive $1\times1$ convolutions. The learnt attention matrix is
multiplied by input values to achieve the dynamic contextual representation of
inputs. The fusion of the static and dynamic contextual representations are
finally taken as outputs. Our CoT block is appealing in the view that it can
readily replace each $3\times3$ convolution in ResNet architectures, yielding a
Transformer-style backbone named as Contextual Transformer Networks (CoTNet).
Through extensive experiments over a wide range of applications (e.g., image
recognition, object detection and instance segmentation), we validate the
superiority of CoTNet as a stronger backbone. Source code is available at
\url{https://github.com/JDAI-CV/CoTNet}.
Related papers
- Efficient Point Transformer with Dynamic Token Aggregating for Point Cloud Processing [19.73918716354272]
We propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing.
It achieves SOTA performance with up to 30$times$ faster than prior point Transformers on ModelNet40, ShapeNet, and airborne MultiSpectral LiDAR (MS-LiDAR) datasets.
arXiv Detail & Related papers (2024-05-23T20:50:50Z) - High-Performance Transformers for Table Structure Recognition Need Early
Convolutions [25.04573593082671]
Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder.
We design a lightweight visual encoder for table structure recognition (TSR) without sacrificing expressive power.
We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model.
arXiv Detail & Related papers (2023-11-09T18:20:52Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Vision Transformer with Quadrangle Attention [76.35955924137986]
We propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation.
Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles.
We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost.
arXiv Detail & Related papers (2023-03-27T11:13:50Z) - ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2.
It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units.
Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z) - Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding [27.568879624013576]
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding.
Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity.
We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
arXiv Detail & Related papers (2022-09-28T09:43:02Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.