Learning Unseen Modality Interaction
- URL: http://arxiv.org/abs/2306.12795v3
- Date: Wed, 25 Oct 2023 09:11:50 GMT
- Title: Learning Unseen Modality Interaction
- Authors: Yunhua Zhang and Hazel Doughty and Cees G.M. Snoek
- Abstract summary: Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
- Score: 54.23533023883659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning assumes all modality combinations of interest are
available during training to learn cross-modal correspondences. In this paper,
we challenge this modality-complete assumption for multimodal learning and
instead strive for generalization to unseen modality combinations during
inference. We pose the problem of unseen modality interaction and introduce a
first solution. It exploits a module that projects the multidimensional
features of different modalities into a common space with rich information
preserved. This allows the information to be accumulated with a simple
summation operation across available modalities. To reduce overfitting to less
discriminative modality combinations during training, we further improve the
model learning with pseudo-supervision indicating the reliability of a
modality's prediction. We demonstrate that our approach is effective for
diverse tasks and modalities by evaluating it for multimodal video
classification, robot state regression, and multimedia retrieval. Project
website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.
Related papers
- Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models [6.610033827647869]
In real-world scenarios, consistently acquiring complete multimodal data presents significant challenges.
This often leads to the issue of missing modalities, where data for certain modalities are absent.
We propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method.
arXiv Detail & Related papers (2024-07-17T14:44:25Z) - Diagnosing and Re-learning for Balanced Multimodal Learning [8.779005254634857]
We propose the Diagnosing & Re-learning method to overcome the imbalanced multimodal learning problem.
The learning state of each modality is estimated based on the separability of its uni-modal representation space.
In this way, the over-emphasizing of scarcely informative modalities is avoided.
arXiv Detail & Related papers (2024-07-12T22:12:03Z) - ReconBoost: Boosting Can Achieve Modality Reconcilement [89.4377895465204]
We study the modality-alternating learning paradigm to achieve reconcilement.
We propose a new method called ReconBoost to update a fixed modality each time.
We show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others.
arXiv Detail & Related papers (2024-05-15T13:22:39Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal
Sentiment Analysis [18.4364234071951]
We propose a novel framework HyCon for hybrid contrastive learning of tri-modal representation.
Specifically, we simultaneously perform intra-/inter-modal contrastive learning and semi-contrastive learning.
Our proposed method outperforms existing works.
arXiv Detail & Related papers (2021-09-04T06:04:21Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.