Parameter Efficient Multimodal Transformers for Video Representation
Learning
- URL: http://arxiv.org/abs/2012.04124v1
- Date: Tue, 8 Dec 2020 00:16:13 GMT
- Title: Parameter Efficient Multimodal Transformers for Video Representation
Learning
- Authors: Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale
Song
- Abstract summary: This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
- Score: 108.8517364784009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent success of Transformers in the language domain has motivated
adapting it to a multimodal setting, where a new visual model is trained in
tandem with an already pretrained language model. However, due to the excessive
memory requirements from Transformers, existing work typically fixes the
language model and train only the vision module, which limits its ability to
learn cross-modal information in an end-to-end manner. In this work, we focus
on reducing the parameters of multimodal Transformers in the context of
audio-visual video representation learning. We alleviate the high memory
requirement by sharing the weights of Transformers across layers and
modalities; we decompose the Transformer into modality-specific and
modality-shared parts so that the model learns the dynamics of each modality
both individually and together, and propose a novel parameter sharing scheme
based on low-rank approximation. We show that our approach reduces parameters
up to 80$\%$, allowing us to train our model end-to-end from scratch. We also
propose a negative sampling approach based on an instance similarity measured
on the CNN embedding space that our model learns with the Transformers. To
demonstrate our approach, we pretrain our model on 30-second clips from
Kinetics-700 and transfer it to audio-visual classification tasks.
Related papers
- Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities [56.666806962214565]
We propose to improve transformers of a specific modality with irrelevant data from other modalities.
We use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models.
We observe significant and consistent performance improvements with irrelevant data from other modalities.
arXiv Detail & Related papers (2024-01-25T18:59:58Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Efficient Training of Audio Transformers with Patchout [7.073210405344709]
We propose a novel method to optimize and regularize transformers on audio spectrograms.
The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
arXiv Detail & Related papers (2021-10-11T08:07:50Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers [16.88840622945725]
We develop the Subformer, a parameter efficient Transformer-based model.
Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
arXiv Detail & Related papers (2021-01-01T13:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.