Improved Multiscale Vision Transformers for Classification and Detection
- URL: http://arxiv.org/abs/2112.01526v1
- Date: Thu, 2 Dec 2021 18:59:57 GMT
- Title: Improved Multiscale Vision Transformers for Classification and Detection
- Authors: Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong,
Jitendra Malik, Christoph Feichtenhofer
- Abstract summary: We study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection.
We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition.
- Score: 80.64111139883694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study Multiscale Vision Transformers (MViT) as a unified
architecture for image and video classification, as well as object detection.
We present an improved version of MViT that incorporates decomposed relative
positional embeddings and residual pooling connections. We instantiate this
architecture in five sizes and evaluate it for ImageNet classification, COCO
detection and Kinetics video recognition where it outperforms prior work. We
further compare MViTs' pooling attention to window attention mechanisms where
it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViT
has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet
classification, 56.1 box AP on COCO object detection as well as 86.1% on
Kinetics-400 video classification. Code and models will be made publicly
available.
Related papers
- ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections [8.372189962601077]
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers.
We propose a novel residual attention learning method for improving ViT-based architectures.
arXiv Detail & Related papers (2024-02-17T14:44:10Z) - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Co-training Transformer with Videos and Images Improves Action
Recognition [49.160505782802886]
In learning action recognition, models are typically pretrained on object recognition images, such as ImageNet, and later finetuned on target action recognition with videos.
This approach has achieved good empirical performance especially with recent transformer-based video architectures.
We show how video transformers benefit from joint training on diverse video datasets and label spaces.
arXiv Detail & Related papers (2021-12-14T05:41:39Z) - Attend and Guide (AG-Net): A Keypoints-driven Attention-based Deep
Network for Image Recognition [13.230646408771868]
We propose an end-to-end CNN model, which learns meaningful features linking fine-grained changes using our novel attention mechanism.
It captures the spatial structures in images by identifying semantic regions (SRs) and their spatial distributions, and is proved to be the key to modelling subtle changes in images.
The framework is evaluated on six diverse benchmark datasets.
arXiv Detail & Related papers (2021-10-23T09:43:36Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.