Attention Bottlenecks for Multimodal Fusion
- URL: http://arxiv.org/abs/2107.00135v1
- Date: Wed, 30 Jun 2021 22:44:12 GMT
- Title: Attention Bottlenecks for Multimodal Fusion
- Authors: Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid
and Chen Sun
- Abstract summary: Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
- Score: 90.75885715478054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans perceive the world by concurrently processing and fusing
high-dimensional inputs from multiple modalities such as vision and audio.
Machine perception models, in stark contrast, are typically modality-specific
and optimised for unimodal benchmarks, and hence late-stage fusion of final
representations or predictions from each modality (`late-fusion') is still a
dominant paradigm for multimodal video classification. Instead, we introduce a
novel transformer based architecture that uses `fusion bottlenecks' for
modality fusion at multiple layers. Compared to traditional pairwise
self-attention, our model forces information between different modalities to
pass through a small number of bottleneck latents, requiring the model to
collate and condense the most relevant information in each modality and only
share what is necessary. We find that such a strategy improves fusion
performance, at the same time reducing computational cost. We conduct thorough
ablation studies, and achieve state-of-the-art results on multiple audio-visual
classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All
code and models will be released.
Related papers
- Fine-Grained Scene Image Classification with Modality-Agnostic Adapter [8.801601759337006]
We present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter)
We eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion.
Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods.
arXiv Detail & Related papers (2024-07-03T02:57:14Z) - FusionBench: A Comprehensive Benchmark of Deep Model Fusion [78.80920533793595]
Deep model fusion is a technique that unifies the predictions or parameters of several deep neural networks into a single model.
FusionBench is the first comprehensive benchmark dedicated to deep model fusion.
arXiv Detail & Related papers (2024-06-05T13:54:28Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - A Self-Adjusting Fusion Representation Learning Model for Unaligned
Text-Audio Sequences [16.38826799727453]
How to integrate relevant information of each modality to learn fusion representations has been one of the central challenges in multimodal learning.
In this paper, a Self-Adjusting Fusion Representation Learning Model is proposed to learn robust crossmodal fusion representations directly from the unaligned text and audio sequences.
Experiment results show that our model has significantly improved the performance of all the metrics on the unaligned text-audio sequences.
arXiv Detail & Related papers (2022-11-12T13:05:28Z) - Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval [36.50847375135979]
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation.
We present a multi-modal, modality fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation.
arXiv Detail & Related papers (2021-12-08T18:14:57Z) - ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale
Fusion of Locally Descriptors [15.042741192427334]
This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio.
Experiments on three popular sentiment analysis benchmarks, IEMOCAP, MOSI, and MOSEI, demonstrate significant gains over baselines.
arXiv Detail & Related papers (2021-12-02T16:09:33Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Perceiver: General Perception with Iterative Attention [85.65927856589613]
We introduce the Perceiver - a model that builds upon Transformers.
We show that this architecture performs competitively or beyond strong, specialized models on classification tasks.
It also surpasses state-of-the-art results for all modalities in AudioSet.
arXiv Detail & Related papers (2021-03-04T18:20:50Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.