MMOE: Mixture of Multimodal Interaction Experts
- URL: http://arxiv.org/abs/2311.09580v1
- Date: Thu, 16 Nov 2023 05:31:21 GMT
- Title: MMOE: Mixture of Multimodal Interaction Experts
- Authors: Haofei Yu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
- Abstract summary: MMOE stands for a mixture of multimodal interaction experts.
Our method automatically classifies data points from unlabeled multimodal datasets by their interaction type and employs specialized models for each specific interaction.
Based on our experiments, this approach improves performance on these challenging interactions by more than 10%, leading to an overall increase of 2% for tasks like sarcasm prediction.
- Score: 115.20477067767399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal machine learning, which studies the information and interactions
across various input modalities, has made significant advancements in
understanding the relationship between images and descriptive text. However,
this is just a portion of the potential multimodal interactions seen in the
real world and does not include new interactions between conflicting utterances
and gestures in predicting sarcasm, for example. Notably, the current methods
for capturing shared information often do not extend well to these more nuanced
interactions, sometimes performing as low as 50% in binary classification. In
this paper, we address this problem via a new approach called MMOE, which
stands for a mixture of multimodal interaction experts. Our method
automatically classifies data points from unlabeled multimodal datasets by
their interaction type and employs specialized models for each specific
interaction. Based on our experiments, this approach improves performance on
these challenging interactions by more than 10%, leading to an overall increase
of 2% for tasks like sarcasm prediction. As a result, interaction
quantification provides new insights for dataset analysis and yields simple
approaches that obtain state-of-the-art performance.
Related papers
- AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Revisiting Disentanglement and Fusion on Modality and Context in
Conversational Multimodal Emotion Recognition [81.2011058113579]
We argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps.
We propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism ( CRM) for multimodal and context integration.
Our system achieves new state-of-the-art performance consistently.
arXiv Detail & Related papers (2023-08-08T18:11:27Z) - Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input [27.102030262319197]
We present textbfSwitch-BERT for joint vision and language representation learning to address the problem of modality mismatch.
Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions.
Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances.
arXiv Detail & Related papers (2023-06-25T09:28:40Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content
Dilutions [27.983902791798965]
We develop a model that generates dilution text that maintains relevance and topical coherence with the image and existing text.
We find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model.
Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations.
arXiv Detail & Related papers (2022-11-04T17:58:02Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - M2Lens: Visualizing and Explaining Multimodal Models for Sentiment
Analysis [28.958168542624062]
We present an interactive visual analytics system, M2Lens, to visualize and explain multimodal models for sentiment analysis.
M2Lens provides explanations on intra- and inter-modal interactions at the global, subset, and local levels.
arXiv Detail & Related papers (2021-07-17T15:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.