MMViT: Multiscale Multiview Vision Transformers
- URL: http://arxiv.org/abs/2305.00104v1
- Date: Fri, 28 Apr 2023 21:51:41 GMT
- Title: MMViT: Multiscale Multiview Vision Transformers
- Authors: Yuchen Liu, Natasha Ong, Kaiyan Peng, Bo Xiong, Qifan Wang, Rui Hou,
Madian Khabsa, Kaiyue Yang, David Liu, Donald S. Williamson, Hanchao Yu
- Abstract summary: We present Multiscale Multiview Vision Transformers (MMViT), which introduces multiscale feature maps and multiview encodings to transformer models.
Our model encodes different views of the input signal and builds several channel-resolution feature stages to process the multiple views of the input at different resolutions in parallel.
We demonstrate the effectiveness of MMViT on audio and image classification tasks, achieving state-of-the-art results.
- Score: 36.93551299085767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Multiscale Multiview Vision Transformers (MMViT), which introduces
multiscale feature maps and multiview encodings to transformer models. Our
model encodes different views of the input signal and builds several
channel-resolution feature stages to process the multiple views of the input at
different resolutions in parallel. At each scale stage, we use a
cross-attention block to fuse information across different views. This enables
the MMViT model to acquire complex high-dimensional representations of the
input at different resolutions. The proposed model can serve as a backbone
model in multiple domains. We demonstrate the effectiveness of MMViT on audio
and image classification tasks, achieving state-of-the-art results.
Related papers
- A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior.
We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information.
We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z) - MVTN: A Multiscale Video Transformer Network for Hand Gesture Recognition [5.311735227179715]
We introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition.
The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures.
Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters.
arXiv Detail & Related papers (2024-09-05T19:55:38Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
Recognition [11.573689558780764]
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-Vi) for video action recognition.
In order to handle large number of tokens extracted from multiple modalities, we develop several model variants which factorize self-attention across the space, time and modality dimensions.
Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy.
arXiv Detail & Related papers (2021-08-20T18:05:39Z) - Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models.
We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.