Routing with Self-Attention for Multimodal Capsule Networks
- URL: http://arxiv.org/abs/2112.00775v1
- Date: Wed, 1 Dec 2021 19:01:26 GMT
- Title: Routing with Self-Attention for Multimodal Capsule Networks
- Authors: Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel
Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah
- Abstract summary: We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework.
To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules.
This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
- Score: 108.85007719132618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of multimodal learning has seen a growing interest recently as it
allows for training neural architectures based on different modalities such as
vision, text, and audio. One challenge in training such models is that they
need to jointly learn semantic concepts and their relationships across
different input representations. Capsule networks have been shown to perform
well in context of capturing the relation between low-level input features and
higher-level concepts. However, capsules have so far mainly been used only in
small-scale fully supervised settings due to the resource demand of
conventional routing algorithms. We present a new multimodal capsule network
that allows us to leverage the strength of capsules in the context of a
multimodal learning framework on large amounts of video data. To adapt the
capsules to large-scale input data, we propose a novel routing by
self-attention mechanism that selects relevant capsules which are then used to
generate a final joint multimodal feature representation. This allows not only
for robust training with noisy video data, but also to scale up the size of the
capsule network compared to traditional routing methods while still being
computationally efficient. We evaluate the proposed architecture by pretraining
it on a large-scale multimodal video dataset and applying it on four datasets
in two challenging downstream tasks. Results show that the proposed multimodal
capsule network is not only able to improve results compared to other routing
techniques, but also achieves competitive performance on the task of multimodal
learning.
Related papers
- Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Modality Competition: What Makes Joint Training of Multi-modal Network
Fail in Deep Learning? (Provably) [75.38159612828362]
It has been observed that the best uni-modal network outperforms the jointly trained multi-modal network.
This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework.
arXiv Detail & Related papers (2022-03-23T06:21:53Z) - Bandit Sampling for Multiplex Networks [8.771092194928674]
We propose an algorithm for scalable learning on multiplex networks with a large number of layers.
Online learning algorithm learns how to sample relevant neighboring layers so that only the layers with relevant information are aggregated during training.
We present experimental results on both synthetic and real-world scenarios.
arXiv Detail & Related papers (2022-02-08T03:26:34Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z) - Training Deep Capsule Networks with Residual Connections [0.0]
Capsule networks are a type of neural network that have recently gained increased popularity.
They consist of groups of neurons, called capsules, which encode properties of objects or object parts.
Most capsule network implementations use two to three capsule layers, which limits their applicability as expressivity grows exponentially with depth.
We propose a methodology to train deeper capsule networks using residual connections, which is evaluated on four datasets and three different routing algorithms.
Our experimental results show that in fact, performance increases when training deeper capsule networks.
arXiv Detail & Related papers (2021-04-15T11:42:44Z) - Multimodal Knowledge Expansion [14.332957885505547]
We propose a knowledge distillation-based framework to utilize multimodal data without requiring labels.
We show that a multimodal student model consistently denoises pseudo labels and generalizes better than its teacher.
arXiv Detail & Related papers (2021-03-26T12:32:07Z) - Unpaired Multi-modal Segmentation via Knowledge Distillation [77.39798870702174]
We propose a novel learning scheme for unpaired cross-modality image segmentation.
In our method, we heavily reuse network parameters, by sharing all convolutional kernels across CT and MRI.
We have extensively validated our approach on two multi-class segmentation problems.
arXiv Detail & Related papers (2020-01-06T20:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.