Single-branch Network for Multimodal Training
- URL: http://arxiv.org/abs/2303.06129v1
- Date: Fri, 10 Mar 2023 18:48:40 GMT
- Title: Single-branch Network for Multimodal Training
- Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Haris Khan, Muhammad Zaigham
Zaheer, Karthik Nandakumar, Muhammad Haroon Yousaf, Arif Mahmood
- Abstract summary: We propose a novel single-branch network capable of learning discriminative representation of unimodal as well as multimodal tasks without changing the network.
We evaluate our proposed single-branch network on the challenging multimodal problem (face-voice association) for cross-modal verification and matching tasks with various loss formulations.
- Score: 19.690844799632327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid growth of social media platforms, users are sharing billions
of multimedia posts containing audio, images, and text. Researchers have
focused on building autonomous systems capable of processing such multimedia
data to solve challenging multimodal tasks including cross-modal retrieval,
matching, and verification. Existing works use separate networks to extract
embeddings of each modality to bridge the gap between them. The modular
structure of their branched networks is fundamental in creating numerous
multimodal applications and has become a defacto standard to handle multiple
modalities. In contrast, we propose a novel single-branch network capable of
learning discriminative representation of unimodal as well as multimodal tasks
without changing the network. An important feature of our single-branch network
is that it can be trained either using single or multiple modalities without
sacrificing performance. We evaluated our proposed single-branch network on the
challenging multimodal problem (face-voice association) for cross-modal
verification and matching tasks with various loss formulations. Experimental
results demonstrate the superiority of our proposed single-branch network over
the existing methods in a wide range of experiments. Code:
https://github.com/msaadsaeed/SBNet
Related papers
- Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - OmniVec: Learning robust representations with cross modal sharing [28.023214572340336]
We present an approach to learn multiple tasks, in multiple modalities, with a unified architecture.
The proposed network is composed of task specific encoders, a common trunk in the middle, followed by task specific prediction heads.
We train the network on all major modalities, e.g. visual, audio, text and 3D, and report results on $22$ diverse and challenging public benchmarks.
arXiv Detail & Related papers (2023-11-07T14:00:09Z) - Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation [16.17270247327955]
We propose a simple and parameter-efficient adaptation procedure for pretrained multimodal networks.
We demonstrate that such adaptation can partially bridge performance drop due to missing modalities.
Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.
arXiv Detail & Related papers (2023-10-06T03:04:21Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Modality Competition: What Makes Joint Training of Multi-modal Network
Fail in Deep Learning? (Provably) [75.38159612828362]
It has been observed that the best uni-modal network outperforms the jointly trained multi-modal network.
This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework.
arXiv Detail & Related papers (2022-03-23T06:21:53Z) - Channel Exchanging Networks for Multimodal and Multitask Dense Image
Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning.
CEN dynamically exchanges channels betweenworks of different modalities.
For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z) - Routing with Self-Attention for Multimodal Capsule Networks [108.85007719132618]
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework.
To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules.
This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
arXiv Detail & Related papers (2021-12-01T19:01:26Z) - Multi-Task Learning with Sequence-Conditioned Transporter Networks [67.57293592529517]
We aim to solve multi-task learning through the lens of sequence-conditioning and weighted sampling.
We propose a new suite of benchmark aimed at compositional tasks, MultiRavens, which allows defining custom task combinations.
Second, we propose a vision-based end-to-end system architecture, Sequence-Conditioned Transporter Networks, which augments Goal-Conditioned Transporter Networks with sequence-conditioning and weighted sampling.
arXiv Detail & Related papers (2021-09-15T21:19:11Z) - Deep Multimodal Neural Architecture Search [178.35131768344246]
We devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks.
Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone.
On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks.
arXiv Detail & Related papers (2020-04-25T07:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.