Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition
- URL: http://arxiv.org/abs/2305.13583v4
- Date: Mon, 13 Nov 2023 00:09:47 GMT
- Title: Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition
- Authors: Yaoting Wang, Yuanchao Li, Paul Pu Liang, Louis-Philippe Morency,
Peter Bell, Catherine Lai
- Abstract summary: Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
- Score: 69.32305810128994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fusing multiple modalities has proven effective for multimodal information
processing. However, the incongruity between modalities poses a challenge for
multimodal fusion, especially in affect recognition. In this study, we first
analyze how the salient affective information in one modality can be affected
by the other, and demonstrate that inter-modal incongruity exists latently in
crossmodal attention. Based on this finding, we propose the Hierarchical
Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight
incongruity-aware model, which dynamically chooses the primary modality in each
training batch and reduces fusion times by leveraging the learned hierarchy in
the latent space to alleviate incongruity. The experimental evaluation on five
benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP (sentiment and emotion),
where incongruity implicitly lies in hard samples, as well as UR-FUNNY (humour)
and MUStaRD (sarcasm), where incongruity is common, verifies the efficacy of
our approach, showing that HCT-DMG: 1) outperforms previous multimodal models
with a reduced size of approximately 0.8M parameters; 2) recognizes hard
samples where incongruity makes affect recognition difficult; 3) mitigates the
incongruity at the latent level in crossmodal attention.
Related papers
- RADAR: Robust Two-stage Modality-incomplete Industrial Anomaly Detection [61.71770293720491]
We propose a novel two-stage Robust modAlity-imcomplete fusing and Detecting frAmewoRk, abbreviated as RADAR.
Our bootstrapping philosophy is to enhance two stages in MIIAD, improving the robustness of the Multimodal Transformer.
Our experimental results demonstrate that the proposed RADAR significantly surpasses conventional MIAD methods in terms of effectiveness and robustness.
arXiv Detail & Related papers (2024-10-02T16:47:55Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Missing-modality Enabled Multi-modal Fusion Architecture for Medical
Data [8.472576865966744]
Fusing multi-modal data can improve the performance of deep learning models.
Missing modalities are common for medical data due to patients' specificity.
This study developed an efficient multi-modal fusion architecture for medical data that was robust to missing modalities.
arXiv Detail & Related papers (2023-09-27T09:46:07Z) - VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z) - Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content
Dilutions [27.983902791798965]
We develop a model that generates dilution text that maintains relevance and topical coherence with the image and existing text.
We find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model.
Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations.
arXiv Detail & Related papers (2022-11-04T17:58:02Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.