Related papers: MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

URL: http://arxiv.org/abs/2507.04635v1
Date: Mon, 07 Jul 2025 03:37:42 GMT
Title: MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding
Authors: Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang,
Abstract summary: Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities.<n>Modular Duplex Attention (MODA) simultaneously conducts the inner-modal refinement and inter-modal interaction.<n>Experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.
Score: 24.731387422897644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in https://zzcheng.top/MODA.

Related papers

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy [9.590408084883402]
We propose a three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction.<n>We introduce Nano-EmoX, a small-scale multitask modeling, and P2E (PerceptiontoEmpathy), a curriculum-based training framework.<n>The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks.
arXiv Detail & Related papers (2026-03-02T17:42:33Z)
Revealing the Attention Floating Mechanism in Masked Diffusion Models [52.74142815156738]
Masked diffusion models (MDMs) leverage bidirectional attention and a denoising process.<n>This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating.
arXiv Detail & Related papers (2026-01-12T09:10:05Z)
Multi-speaker Attention Alignment for Multimodal Social Interaction [23.550501177885625]
Social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues.<n>While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks.<n>We propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs.
arXiv Detail & Related papers (2025-11-22T07:35:47Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection [0.0]
We introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion.<n>We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance.
arXiv Detail & Related papers (2025-05-25T07:26:00Z)
MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network [6.304608172789466]
The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities.<n>MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts.<n>The architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations.
arXiv Detail & Related papers (2025-03-16T19:32:32Z)
Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition [37.12407597998884]
A novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues.<n>GraphSmile comprises two key components, i.e., GSF and SDP modules.<n> Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns.
arXiv Detail & Related papers (2024-07-31T11:47:36Z)
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields. Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion. We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z)
AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z)
MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE) MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input [27.102030262319197]
We present textbfSwitch-BERT for joint vision and language representation learning to address the problem of modality mismatch. Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions. Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances.
arXiv Detail & Related papers (2023-06-25T09:28:40Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination. In this paper, we enrich intra-modality and cross-modality relations for representation modeling. We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.