Multimodal Routing: Improving Local and Global Interpretability of
Multimodal Language Analysis
- URL: http://arxiv.org/abs/2004.14198v2
- Date: Mon, 5 Oct 2020 04:56:42 GMT
- Title: Multimodal Routing: Improving Local and Global Interpretability of
Multimodal Language Analysis
- Authors: Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdinov,
and Louis-Philippe Morency
- Abstract summary: Recent multimodal learning with strong performances on human-centric tasks are often black-box.
We propose Multimodal Routing, which adjusts weights between input modalities and output representations differently for each input sample.
- Score: 103.69656907534456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The human language can be expressed through multiple sources of information
known as modalities, including tones of voice, facial gestures, and spoken
language. Recent multimodal learning with strong performances on human-centric
tasks such as sentiment analysis and emotion recognition are often black-box,
with very limited interpretability. In this paper we propose Multimodal
Routing, which dynamically adjusts weights between input modalities and output
representations differently for each input sample. Multimodal routing can
identify relative importance of both individual modalities and cross-modality
features. Moreover, the weight assignment by routing allows us to interpret
modality-prediction relationships not only globally (i.e. general trends over
the whole dataset), but also locally for each single input sample, meanwhile
keeping competitive performance compared to state-of-the-art methods.
Related papers
- U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Interpretable Tensor Fusion [26.314148163750257]
We introduce interpretable tensor fusion (InTense), a method for training neural networks to simultaneously learn multimodal data representations.
InTense provides interpretability out of the box by assigning relevance scores to modalities and their associations.
Experiments on six real-world datasets show that InTense outperforms existing state-of-the-art multimodal interpretable approaches in terms of accuracy and interpretability.
arXiv Detail & Related papers (2024-05-07T21:05:50Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Generalized Product-of-Experts for Learning Multimodal Representations
in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique.
In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality.
We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z) - Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural
Machine Translation Training [58.72619374790418]
MultiUAT dynamically adjusts the training data usage based on the model's uncertainty.
We analyze the cross-domain transfer and show the deficiency of static and similarity based methods.
arXiv Detail & Related papers (2021-09-06T08:30:33Z) - Cross-Modal Generalization: Learning in Low Resource Modalities via
Meta-Alignment [99.29153138760417]
Cross-modal generalization is a learning paradigm to train a model that can quickly perform new tasks in a target modality.
We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities?
Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data.
arXiv Detail & Related papers (2020-12-04T19:27:26Z) - Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing.
To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only.
The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.