MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
Recognition
- URL: http://arxiv.org/abs/2108.09322v1
- Date: Fri, 20 Aug 2021 18:05:39 GMT
- Title: MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
Recognition
- Authors: Jiawei Chen, Chiu Man Ho
- Abstract summary: This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-Vi) for video action recognition.
In order to handle large number of tokens extracted from multiple modalities, we develop several model variants which factorize self-attention across the space, time and modality dimensions.
Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy.
- Score: 11.573689558780764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a pure transformer-based approach, dubbed the Multi-Modal
Video Transformer (MM-ViT), for video action recognition. Different from other
schemes which solely utilize the decoded RGB frames, MM-ViT operates
exclusively in the compressed video domain and exploits all readily available
modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In
order to handle the large number of spatiotemporal tokens extracted from
multiple modalities, we develop several scalable model variants which factorize
self-attention across the space, time and modality dimensions. In addition, to
further explore the rich inter-modal interactions and their effects, we develop
and compare three distinct cross-modal attention mechanisms that can be
seamlessly integrated into the transformer building block. Extensive
experiments on three public action recognition benchmarks (UCF-101,
Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the
state-of-the-art video transformers in both efficiency and accuracy, and
performs better or equally well to the state-of-the-art CNN counterparts with
computationally-heavy optical flow.
Related papers
- MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
Audio-Video Classification [6.341420717393898]
We propose a novel audio-video recognition approach termed audio video Transformer, AVT, to learn from multimodal videos.
For multimodal fusion, simply conenating tokens in a cross-temporal Transformer requires large computational and memory resources.
We integrate self-supervised objectives, audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space.
arXiv Detail & Related papers (2024-01-08T16:58:59Z) - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - MMViT: Multiscale Multiview Vision Transformers [36.93551299085767]
We present Multiscale Multiview Vision Transformers (MMViT), which introduces multiscale feature maps and multiview encodings to transformer models.
Our model encodes different views of the input signal and builds several channel-resolution feature stages to process the multiple views of the input at different resolutions in parallel.
We demonstrate the effectiveness of MMViT on audio and image classification tasks, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-04-28T21:51:41Z) - MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer [12.544216587327387]
We present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video.
The presented Multiscale-Decoder Video (MED-VT) uses multiscale representation throughout and employs an optional input beyond video.
We present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions.
arXiv Detail & Related papers (2023-04-12T15:50:19Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - Zorro: the masked multimodal transformer [68.99684436029884]
Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
We show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
arXiv Detail & Related papers (2023-01-23T17:51:39Z) - MAGVIT: Masked Generative Video Transformer [129.50814875955444]
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains.
arXiv Detail & Related papers (2022-12-10T04:26:32Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.