VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text
- URL: http://arxiv.org/abs/2104.11178v1
- Date: Thu, 22 Apr 2021 17:07:41 GMT
- Title: VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text
- Authors: Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu
Chang, Yin Cui, Boqing Gong
- Abstract summary: Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
- Score: 60.97904439526213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a framework for learning multimodal representations from unlabeled
data using convolution-free Transformer architectures. Specifically, our
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts
multimodal representations that are rich enough to benefit a variety of
downstream tasks. We train VATT end-to-end from scratch using multimodal
contrastive losses and evaluate its performance by the downstream tasks of
video action recognition, audio event classification, image classification, and
text-to-video retrieval. Furthermore, we study a modality-agnostic
single-backbone Transformer by sharing weights among the three modalities. We
show that the convolution-free VATT outperforms state-of-the-art ConvNet-based
architectures in the downstream tasks. Especially, VATT's vision Transformer
achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,and
41.1% on Moments in Time, new records while avoiding supervised pre-training.
Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet
compared to 64.7% by training the same Transformer from scratch, showing the
generalizability of our model despite the domain gap between videos and images.
VATT's audio Transformer also sets a new record on waveform-based audio event
recognition by achieving the mAP of 39.4% on AudioSet without any supervised
pre-training.
Related papers
- SAVE: Segment Audio-Visual Easy way using Segment Anything Model [0.0]
This study presents a lightweight approach, SAVE, which efficiently adapts the pre-trained segment anything model (SAM) to the AVS task.
Our proposed model achieves effective audio-visual fusion and interaction during the encoding stage.
arXiv Detail & Related papers (2024-07-02T07:22:28Z) - Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
Classification [6.341420717393898]
We develop a novel multiscale audio Transformer (MAT) and a multiscale video Transformer (MMT)
The proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets.
It is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
arXiv Detail & Related papers (2024-01-08T17:02:25Z) - Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
Audio-Video Classification [6.341420717393898]
We propose a novel audio-video recognition approach termed audio video Transformer, AVT, to learn from multimodal videos.
For multimodal fusion, simply conenating tokens in a cross-temporal Transformer requires large computational and memory resources.
We integrate self-supervised objectives, audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space.
arXiv Detail & Related papers (2024-01-08T16:58:59Z) - Zorro: the masked multimodal transformer [68.99684436029884]
Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
We show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
arXiv Detail & Related papers (2023-01-23T17:51:39Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - Co-training Transformer with Videos and Images Improves Action
Recognition [49.160505782802886]
In learning action recognition, models are typically pretrained on object recognition images, such as ImageNet, and later finetuned on target action recognition with videos.
This approach has achieved good empirical performance especially with recent transformer-based video architectures.
We show how video transformers benefit from joint training on diverse video datasets and label spaces.
arXiv Detail & Related papers (2021-12-14T05:41:39Z) - Improved Multiscale Vision Transformers for Classification and Detection [80.64111139883694]
We study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection.
We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition.
arXiv Detail & Related papers (2021-12-02T18:59:57Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.