Cross-Modal Generalization: Learning in Low Resource Modalities via
Meta-Alignment
- URL: http://arxiv.org/abs/2012.02813v1
- Date: Fri, 4 Dec 2020 19:27:26 GMT
- Title: Cross-Modal Generalization: Learning in Low Resource Modalities via
Meta-Alignment
- Authors: Paul Pu Liang, Peter Wu, Liu Ziyin, Louis-Philippe Morency, Ruslan
Salakhutdinov
- Abstract summary: Cross-modal generalization is a learning paradigm to train a model that can quickly perform new tasks in a target modality.
We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities?
Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data.
- Score: 99.29153138760417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The natural world is abundant with concepts expressed via visual, acoustic,
tactile, and linguistic modalities. Much of the existing progress in multimodal
learning, however, focuses primarily on problems where the same set of
modalities are present at train and test time, which makes learning in
low-resource modalities particularly difficult. In this work, we propose
algorithms for cross-modal generalization: a learning paradigm to train a model
that can (1) quickly perform new tasks in a target modality (i.e.
meta-learning) and (2) doing so while being trained on a different source
modality. We study a key research question: how can we ensure generalization
across modalities despite using separate encoders for different source and
target modalities? Our solution is based on meta-alignment, a novel method to
align representation spaces using strongly and weakly paired cross-modal data
while ensuring quick generalization to new tasks across different modalities.
We study this problem on 3 classification tasks: text to image, image to audio,
and text to speech. Our results demonstrate strong performance even when the
new target modality has only a few (1-10) labeled samples and in the presence
of noisy labels, a scenario particularly prevalent in low-resource modalities.
Related papers
- Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations [16.036997801745905]
Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources.
Recent binding methods, such as ImageBind, typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space.
We propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor.
arXiv Detail & Related papers (2024-10-02T23:19:23Z) - MuAP: Multi-step Adaptive Prompt Learning for Vision-Language Model with Missing Modality [11.03329286331929]
We present the first comprehensive investigation into prompt learning behavior when modalities are incomplete.
We propose a novel Multi-step Adaptive Prompt Learning framework, aiming to generate multimodal prompts and perform multi-step prompt tuning.
arXiv Detail & Related papers (2024-09-07T03:33:46Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Distribution Alignment: A Unified Framework for Long-tail Visual
Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition.
We then introduce a generalized re-weight method in the two-stage learning to balance the class prior.
Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z) - Multimodal Categorization of Crisis Events in Social Media [81.07061295887172]
We present a new multimodal fusion method that leverages both images and texts as input.
In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities.
We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
arXiv Detail & Related papers (2020-04-10T06:31:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.