Modality Competition: What Makes Joint Training of Multi-modal Network
Fail in Deep Learning? (Provably)
- URL: http://arxiv.org/abs/2203.12221v1
- Date: Wed, 23 Mar 2022 06:21:53 GMT
- Title: Modality Competition: What Makes Joint Training of Multi-modal Network
Fail in Deep Learning? (Provably)
- Authors: Yu Huang and Junyang Lin and Chang Zhou and Hongxia Yang and Longbo
Huang
- Abstract summary: It has been observed that the best uni-modal network outperforms the jointly trained multi-modal network.
This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework.
- Score: 75.38159612828362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the remarkable success of deep multi-modal learning in practice, it
has not been well-explained in theory. Recently, it has been observed that the
best uni-modal network outperforms the jointly trained multi-modal network,
which is counter-intuitive since multiple signals generally bring more
information. This work provides a theoretical explanation for the emergence of
such performance gap in neural networks for the prevalent joint training
framework. Based on a simplified data distribution that captures the realistic
property of multi-modal data, we prove that for the multi-modal late-fusion
network with (smoothed) ReLU activation trained jointly by gradient descent,
different modalities will compete with each other. The encoder networks will
learn only a subset of modalities. We refer to this phenomenon as modality
competition. The losing modalities, which fail to be discovered, are the
origins where the sub-optimality of joint training comes from. Experimentally,
we illustrate that modality competition matches the intrinsic behavior of
late-fusion joint training.
Related papers
- On the Comparison between Multi-modal and Single-modal Contrastive Learning [50.74988548106031]
We introduce a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning.
We identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning.
Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning.
arXiv Detail & Related papers (2024-11-05T06:21:17Z) - Understanding Unimodal Bias in Multimodal Deep Linear Networks [7.197469507060226]
A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training.
We develop a theory of unimodal bias with multimodal deep linear networks to understand how architecture and data statistics influence this bias.
arXiv Detail & Related papers (2023-12-01T21:29:54Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - MMANet: Margin-aware Distillation and Modality-aware Regularization for
Incomplete Multimodal Learning [4.647741695828225]
MMANet is a framework to assist incomplete multimodal learning.
It consists of three components: the deployment network used for inference, the teacher network transferring comprehensive multimodal information, and the regularization network guiding the deployment network to balance weak modality combinations.
arXiv Detail & Related papers (2023-04-17T07:22:15Z) - Routing with Self-Attention for Multimodal Capsule Networks [108.85007719132618]
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework.
To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules.
This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
arXiv Detail & Related papers (2021-12-01T19:01:26Z) - What Makes Multimodal Learning Better than Single (Provably) [28.793128982222438]
We show that learning with multiple modalities achieves a smaller population risk thanonly using its subset of modalities.
This is the first theoretical treatment to capture important qualitative phenomenaobserved in real multimodal applications.
arXiv Detail & Related papers (2021-06-08T17:20:02Z) - Optimizing Neural Networks via Koopman Operator Theory [6.09170287691728]
Koopman operator theory was recently shown to be intimately connected with neural network theory.
In this work we take the first steps in making use of this connection.
We show that Koopman operator theory methods allow predictions of weights and biases of feed weights over a non-trivial range of training time.
arXiv Detail & Related papers (2020-06-03T16:23:07Z) - Unpaired Multi-modal Segmentation via Knowledge Distillation [77.39798870702174]
We propose a novel learning scheme for unpaired cross-modality image segmentation.
In our method, we heavily reuse network parameters, by sharing all convolutional kernels across CT and MRI.
We have extensively validated our approach on two multi-class segmentation problems.
arXiv Detail & Related papers (2020-01-06T20:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.