Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
Classification
- URL: http://arxiv.org/abs/2401.04023v1
- Date: Mon, 8 Jan 2024 17:02:25 GMT
- Title: Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
Classification
- Authors: Wentao Zhu
- Abstract summary: We develop a novel multiscale audio Transformer (MAT) and a multiscale video Transformer (MMT)
The proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets.
It is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
- Score: 6.341420717393898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, researchers combine both audio and video signals to deal
with challenges where actions are not well represented or captured by visual
cues. However, how to effectively leverage the two modalities is still under
development. In this work, we develop a multiscale multimodal Transformer (MMT)
that leverages hierarchical representation learning. Particularly, MMT is
composed of a novel multiscale audio Transformer (MAT) and a multiscale video
Transformer [43]. To learn a discriminative cross-modality fusion, we further
design multimodal supervised contrastive objectives called audio-video
contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly
align the two modalities. MMT surpasses previous state-of-the-art approaches by
7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy
without external training data. Moreover, the proposed MAT significantly
outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark
datasets, and is about 3% more efficient based on the number of FLOPs and 9.8%
more efficient based on GPU memory usage.
Related papers
- MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - Improving Multimodal Learning with Multi-Loss Gradient Modulation [3.082715511775795]
We improve upon previous work by introducing a multi-loss objective and further refining the balancing process.
We achieve superior results across three audio-video datasets: on CREMA-D, models with ResNet encoder backbones surpass the previous best by 1.9% to 12.4%.
arXiv Detail & Related papers (2024-05-13T17:01:28Z) - Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
Audio-Video Classification [6.341420717393898]
We propose a novel audio-video recognition approach termed audio video Transformer, AVT, to learn from multimodal videos.
For multimodal fusion, simply conenating tokens in a cross-temporal Transformer requires large computational and memory resources.
We integrate self-supervised objectives, audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space.
arXiv Detail & Related papers (2024-01-08T16:58:59Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Multimodal Transformer Distillation for Audio-Visual Synchronization [53.237653873618754]
This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss.
MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining similar performance.
arXiv Detail & Related papers (2022-10-27T15:53:38Z) - A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural
TTS [52.51848317549301]
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis.
A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data.
In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms.
arXiv Detail & Related papers (2022-09-22T09:43:17Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.