High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning
- URL: http://arxiv.org/abs/2203.01311v4
- Date: Wed, 28 Jun 2023 17:58:11 GMT
- Title: High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning
- Authors: Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu,
Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Ruslan Salakhutdinov
- Abstract summary: This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
- Score: 112.51498431119616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many real-world problems are inherently multimodal, from spoken language,
gestures, and paralinguistics humans use to communicate, to force,
proprioception, and visual sensors on robots. While there has been an explosion
of interest in multimodal learning, these methods are focused on a small set of
modalities primarily in language, vision, and audio. In order to accelerate
generalization towards diverse and understudied modalities, this paper studies
efficient representation learning for high-modality scenarios involving a large
set of diverse modalities. Since adding new models for every new modality
becomes prohibitively expensive, a critical technical challenge is
heterogeneity quantification: how can we measure which modalities encode
similar information and interactions in order to permit parameter sharing with
previous modalities? This paper proposes two new information theoretic metrics
for heterogeneity quantification: (1) modality heterogeneity studies how
similar 2 modalities {X1,X2} are by measuring how much information can be
transferred from X1 to X2, while (2) interaction heterogeneity studies how
similarly pairs of modalities {X1,X2}, {X3,X4} interact by measuring how much
information can be transferred from fusing {X1,X2} to {X3,X4}. We show the
importance of these 2 proposed metrics as a way to automatically prioritize the
fusion of modalities that contain unique information or interactions. The
result is a single model, HighMMT, that scales up to 10 modalities (text,
image, audio, video, sensors, proprioception, speech, time-series, sets, and
tables) and 15 tasks from 5 research areas. Not only does HighMMT outperform
prior methods on the tradeoff between performance and efficiency, it also
demonstrates a crucial scaling behavior: performance continues to improve with
each modality added, and it transfers to entirely new modalities and tasks
during fine-tuning.
Related papers
- Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy? [12.662031101992968]
We investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets.
Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels.
Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
arXiv Detail & Related papers (2024-09-13T22:18:45Z) - Multimodal Graph Learning for Generative Tasks [89.44810441463652]
Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
arXiv Detail & Related papers (2023-10-11T13:25:03Z) - Multimodal Prompt Transformer with Hybrid Contrastive Learning for
Emotion Recognition in Conversation [9.817888267356716]
multimodal Emotion Recognition in Conversation (ERC) faces two problems.
Deep emotion cues extraction was performed on modalities with strong representation ability.
Feature filters were designed as multimodal prompt information for modalities with weak representation ability.
MPT embeds multimodal fusion information into each attention layer of the Transformer.
arXiv Detail & Related papers (2023-10-04T13:54:46Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - Video Sentiment Analysis with Bimodal Information-augmented Multi-Head
Attention [7.997124140597719]
This study focuses on the sentiment analysis of videos containing time series data of multiple modalities.
The key problem is how to fuse these heterogeneous data.
Based on bimodal interaction, more important bimodal features are assigned larger weights.
arXiv Detail & Related papers (2021-03-03T12:30:11Z) - Cross-Modal Generalization: Learning in Low Resource Modalities via
Meta-Alignment [99.29153138760417]
Cross-modal generalization is a learning paradigm to train a model that can quickly perform new tasks in a target modality.
We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities?
Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data.
arXiv Detail & Related papers (2020-12-04T19:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.