MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal
Language Sequences
- URL: http://arxiv.org/abs/2010.11985v2
- Date: Wed, 28 Apr 2021 18:44:01 GMT
- Title: MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal
Language Sequences
- Authors: Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir
Zadeh, Soujanya Poria, Louis-Philippe Morency
- Abstract summary: MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data.
By learning to focus only on the important interactions within the graph, MTAG achieves state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks.
- Score: 46.146331814606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human communication is multimodal in nature; it is through multiple
modalities such as language, voice, and facial expressions, that opinions and
emotions are expressed. Data in this domain exhibits complex multi-relational
and temporal interactions. Learning from this data is a fundamentally
challenging research problem. In this paper, we propose Modal-Temporal
Attention Graph (MTAG). MTAG is an interpretable graph-based neural model that
provides a suitable framework for analyzing multimodal sequential data. We
first introduce a procedure to convert unaligned multimodal sequence data into
a graph with heterogeneous nodes and edges that captures the rich interactions
across modalities and through time. Then, a novel graph fusion operation,
called MTAG fusion, along with a dynamic pruning and read-out technique, is
designed to efficiently process this modal-temporal graph and capture various
interactions. By learning to focus only on the important interactions within
the graph, MTAG achieves state-of-the-art performance on multimodal sentiment
analysis and emotion recognition benchmarks, while utilizing significantly
fewer model parameters.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - TimeGraphs: Graph-based Temporal Reasoning [64.18083371645956]
TimeGraphs is a novel approach that characterizes dynamic interactions as a hierarchical temporal graph.
Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales.
We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset.
arXiv Detail & Related papers (2024-01-06T06:26:49Z) - Unified and Dynamic Graph for Temporal Character Grouping in Long Videos [31.192044026127032]
Video temporal character grouping locates appearing moments of major characters within a video according to their identities.
Recent works have evolved from unsupervised clustering to graph-based supervised clustering.
We present a unified and dynamic graph (UniDG) framework for temporal character grouping.
arXiv Detail & Related papers (2023-08-27T13:22:55Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Multi-modal Multi-kernel Graph Learning for Autism Prediction and
Biomarker Discovery [29.790200009136825]
We propose a novel method to offset the negative impact between modalities in the process of multi-modal integration and extract heterogeneous information from graphs.
Our method is evaluated on the benchmark Autism Brain Imaging Data Exchange (ABIDE) dataset and outperforms the state-of-the-art methods.
In addition, discriminative brain regions associated with autism are identified by our model, providing guidance for the study of autism pathology.
arXiv Detail & Related papers (2023-03-03T07:09:17Z) - Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph
Pooling Fusion [28.077474663199062]
We propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data.
Our graph-based model reaches state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2020-11-27T06:12:14Z) - Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization [77.21951145754065]
We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
arXiv Detail & Related papers (2020-08-04T08:25:24Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z) - Connecting the Dots: Multivariate Time Series Forecasting with Graph
Neural Networks [91.65637773358347]
We propose a general graph neural network framework designed specifically for multivariate time series data.
Our approach automatically extracts the uni-directed relations among variables through a graph learning module.
Our proposed model outperforms the state-of-the-art baseline methods on 3 of 4 benchmark datasets.
arXiv Detail & Related papers (2020-05-24T04:02:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.