Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
Audio-Video Classification
- URL: http://arxiv.org/abs/2401.04154v1
- Date: Mon, 8 Jan 2024 16:58:59 GMT
- Title: Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
Audio-Video Classification
- Authors: Wentao Zhu
- Abstract summary: We propose a novel audio-video recognition approach termed audio video Transformer, AVT, to learn from multimodal videos.
For multimodal fusion, simply conenating tokens in a cross-temporal Transformer requires large computational and memory resources.
We integrate self-supervised objectives, audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space.
- Score: 6.341420717393898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio and video are two most common modalities in the mainstream media
platforms, e.g., YouTube. To learn from multimodal videos effectively, in this
work, we propose a novel audio-video recognition approach termed audio video
Transformer, AVT, leveraging the effective spatio-temporal representation by
the video Transformer to improve action recognition accuracy. For multimodal
fusion, simply concatenating multimodal tokens in a cross-modal Transformer
requires large computational and memory resources, instead we reduce the
cross-modality complexity through an audio-video bottleneck Transformer. To
improve the learning efficiency of multimodal Transformer, we integrate
self-supervised objectives, i.e., audio-video contrastive learning, audio-video
matching, and masked audio and video learning, into AVT training, which maps
diverse audio and video representations into a common multimodal representation
space. We further propose a masked audio segment loss to learn semantic audio
activities in AVT. Extensive experiments and ablation studies on three public
datasets and two in-house datasets consistently demonstrate the effectiveness
of the proposed AVT. Specifically, AVT outperforms its previous
state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one
of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by
leveraging the audio signal. Compared to one of the previous state-of-the-art
multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and
improves the accuracy by 3.8% on Epic-Kitchens-100.
Related papers
- MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
Classification [6.341420717393898]
We develop a novel multiscale audio Transformer (MAT) and a multiscale video Transformer (MMT)
The proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets.
It is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
arXiv Detail & Related papers (2024-01-08T17:02:25Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Zorro: the masked multimodal transformer [68.99684436029884]
Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
We show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
arXiv Detail & Related papers (2023-01-23T17:51:39Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - Multimodal Transformer Distillation for Audio-Visual Synchronization [53.237653873618754]
This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss.
MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining similar performance.
arXiv Detail & Related papers (2022-10-27T15:53:38Z) - MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
Recognition [11.573689558780764]
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-Vi) for video action recognition.
In order to handle large number of tokens extracted from multiple modalities, we develop several model variants which factorize self-attention across the space, time and modality dimensions.
Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy.
arXiv Detail & Related papers (2021-08-20T18:05:39Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.