Bottleneck Transformers for Visual Recognition
- URL: http://arxiv.org/abs/2101.11605v1
- Date: Wed, 27 Jan 2021 18:55:27 GMT
- Title: Bottleneck Transformers for Visual Recognition
- Authors: Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter
Abbeel, Ashish Vaswani
- Abstract summary: We present BoTNet, a powerful backbone architecture that incorporates self-attention for vision tasks.
We present models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark.
We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
- Score: 97.16013761605254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present BoTNet, a conceptually simple yet powerful backbone architecture
that incorporates self-attention for multiple computer vision tasks including
image classification, object detection and instance segmentation. By just
replacing the spatial convolutions with global self-attention in the final
three bottleneck blocks of a ResNet and no other changes, our approach improves
upon the baselines significantly on instance segmentation and object detection
while also reducing the parameters, with minimal overhead in latency. Through
the design of BoTNet, we also point out how ResNet bottleneck blocks with
self-attention can be viewed as Transformer blocks. Without any bells and
whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance
Segmentation benchmark using the Mask R-CNN framework; surpassing the previous
best published single model and single scale results of ResNeSt evaluated on
the COCO validation set. Finally, we present a simple adaptation of the BoTNet
design for image classification, resulting in models that achieve a strong
performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to
2.33x faster in compute time than the popular EfficientNet models on TPU-v3
hardware. We hope our simple and effective approach will serve as a strong
baseline for future research in self-attention models for vision.
Related papers
- Y-CA-Net: A Convolutional Attention Based Network for Volumetric Medical Image Segmentation [47.12719953712902]
discriminative local features are key components for the performance of attention-based VS methods.
We incorporate the convolutional encoder branch with transformer backbone to extract local and global features in a parallel manner.
Y-CT-Net achieves competitive performance on multiple medical segmentation tasks.
arXiv Detail & Related papers (2024-10-01T18:50:45Z) - Scale-Invariant Object Detection by Adaptive Convolution with Unified Global-Local Context [3.061662434597098]
We propose an object detection model using a Switchable (adaptive) Atrous Convolutional Network (SAC-Net) based on the efficientDet model.
The proposed SAC-Net encapsulates the benefits of both low-level and high-level features to achieve improved performance on multi-scale object detection tasks.
Our experiments on benchmark datasets demonstrate that the proposed SAC-Net outperforms the state-of-the-art models by a significant margin in terms of accuracy.
arXiv Detail & Related papers (2024-09-17T10:08:37Z) - Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration [100.54419875604721]
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation.
We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks.
Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment.
arXiv Detail & Related papers (2024-04-02T17:58:49Z) - Early Fusion of Features for Semantic Segmentation [10.362589129094975]
This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation.
Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012.
The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis.
arXiv Detail & Related papers (2024-02-08T22:58:06Z) - SODAWideNet -- Salient Object Detection with an Attention augmented Wide
Encoder Decoder network without ImageNet pre-training [3.66237529322911]
We explore developing a neural network from scratch directly trained on Salient Object Detection without ImageNet pre-training.
We propose SODAWideNet, an encoder-decoder-style network for Salient Object Detection.
Two variants, SODAWideNet-S (3.03M) and SODAWideNet (9.03M), achieve competitive performance against state-of-the-art models on five datasets.
arXiv Detail & Related papers (2023-11-08T16:53:44Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - Local Grid Rendering Networks for 3D Object Detection in Point Clouds [98.02655863113154]
CNNs are powerful but it would be computationally costly to directly apply convolutions on point data after voxelizing the entire point clouds to a dense regular 3D grid.
We propose a novel and principled Local Grid Rendering (LGR) operation to render the small neighborhood of a subset of input points into a low-resolution 3D grid independently.
We validate LGR-Net for 3D object detection on the challenging ScanNet and SUN RGB-D datasets.
arXiv Detail & Related papers (2020-07-04T13:57:43Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.