An Efficient End-to-End Transformer with Progressive Tri-modal Attention
for Multi-modal Emotion Recognition
- URL: http://arxiv.org/abs/2209.09768v1
- Date: Tue, 20 Sep 2022 14:51:38 GMT
- Title: An Efficient End-to-End Transformer with Progressive Tri-modal Attention
for Multi-modal Emotion Recognition
- Authors: Yang Wu, Pai Peng, Zhenyu Zhang, Yanyan Zhao, Bing Qin
- Abstract summary: We propose the multi-modal end-to-end transformer (ME2ET), which can effectively model the tri-modal features interaction.
At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy.
At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities.
- Score: 27.96711773593048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works on multi-modal emotion recognition move towards end-to-end
models, which can extract the task-specific features supervised by the target
task compared with the two-phase pipeline. However, previous methods only model
the feature interactions between the textual and either acoustic and visual
modalities, ignoring capturing the feature interactions between the acoustic
and visual modalities. In this paper, we propose the multi-modal end-to-end
transformer (ME2ET), which can effectively model the tri-modal features
interaction among the textual, acoustic, and visual modalities at the low-level
and high-level. At the low-level, we propose the progressive tri-modal
attention, which can model the tri-modal feature interactions by adopting a
two-pass strategy and can further leverage such interactions to significantly
reduce the computation and memory complexity through reducing the input token
length. At the high-level, we introduce the tri-modal feature fusion layer to
explicitly aggregate the semantic representations of three modalities. The
experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET
achieves the state-of-the-art performance. The further in-depth analysis
demonstrates the effectiveness, efficiency, and interpretability of the
proposed progressive tri-modal attention, which can help our model to achieve
better performance while significantly reducing the computation and memory
cost. Our code will be publicly available.
Related papers
- DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - X Modality Assisting RGBT Object Tracking [36.614908357546035]
We propose a novel X Modality Assisting Network (X-Net) to shed light on the impact of the fusion paradigm.
To tackle the feature learning hurdles stemming from significant differences between RGB and thermal modalities, a plug-and-play pixel-level generation module (PGM) is proposed.
We also propose a feature-level interaction module (FIM) that incorporates a mixed feature interaction transformer and a spatial-dimensional feature translation strategy.
arXiv Detail & Related papers (2023-12-27T05:38:54Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - UNIMO-3: Multi-granularity Interaction for Vision-Language
Representation Learning [35.88753097105914]
We propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction.
Our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.
arXiv Detail & Related papers (2023-05-23T05:11:34Z) - EffMulti: Efficiently Modeling Complex Multimodal Interactions for
Emotion Analysis [8.941102352671198]
We design three kinds of latent representations to refine the emotion analysis process.
A modality-semantic hierarchical fusion is proposed to reasonably incorporate these representations into a comprehensive interaction representation.
The experimental results demonstrate that our EffMulti outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-12-16T03:05:55Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.