Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
- URL: http://arxiv.org/abs/2505.19010v2
- Date: Wed, 30 Jul 2025 10:40:17 GMT
- Title: Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
- Authors: Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha,
- Abstract summary: Multi-modal learning has emerged as a crucial research direction.<n>Existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies.<n>We propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion.<n>We show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multi-modal learning has emerged as a crucial research direction, as integrating textual and visual information can substantially enhance performance in tasks such as classification, retrieval, and scene understanding. Despite advances with large pre-trained models, existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies, failing to fully harness the complementary strengths of different modalities. To address these limitations, we propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion. Our approach first projects textual and visual features into a shared embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This is further strengthened by a dimension-wise gating network, which adaptively modulates feature contributions at the channel level to emphasize salient information. In parallel, dual-path encoders independently refine modality-specific representations, while an additional cross-attention layer aligns the modalities further. The resulting features are aggregated via an expert fusion module that integrates learned gating and self-attention, yielding a robust unified representation. Experimental results on the MIMIC and SemEval Memotion 1.0 datasets show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment, highlighting its effectiveness for diverse multi-modal applications.
Related papers
- M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification [23.322598623627222]
M$3$amba is a novel end-to-end CLIP-driven Mamba model for multi-modal fusion.<n>We introduce CLIP-driven modality-specific adapters to achieve a comprehensive semantic understanding of different modalities.<n>Experiments have shown that M$3$amba has an average performance improvement of at least 5.98% compared with the state-of-the-art methods.
arXiv Detail & Related papers (2025-03-09T05:06:47Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition [3.5803801804085347]
We introduce Recursive Joint Cross-Modal Attention (RJCMA) to capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition.
In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities.
Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset.
arXiv Detail & Related papers (2024-03-20T15:08:43Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Alternative Telescopic Displacement: An Efficient Multimodal Alignment Method [3.0903319879656084]
This paper introduces an innovative approach to feature alignment that revolutionizes the fusion of multimodal information.
Our method employs a novel iterative process of telescopic displacement and expansion of feature representations across different modalities, culminating in a coherent unified representation within a shared feature space.
arXiv Detail & Related papers (2023-06-29T13:49:06Z) - Knowledge-Enhanced Hierarchical Information Correlation Learning for
Multi-Modal Rumor Detection [82.94413676131545]
We propose a novel knowledge-enhanced hierarchical information correlation learning approach (KhiCL) for multi-modal rumor detection.
KhiCL exploits cross-modal joint dictionary to transfer the heterogeneous unimodality features into the common feature space.
It extracts visual and textual entities from images and text, and designs a knowledge relevance reasoning strategy.
arXiv Detail & Related papers (2023-06-28T06:08:20Z) - Bi-level Dynamic Learning for Jointly Multi-modality Image Fusion and
Beyond [50.556961575275345]
We build an image fusion module to fuse complementary characteristics and cascade dual task-related modules.
We develop an efficient first-order approximation to compute corresponding gradients and present dynamic weighted aggregation to balance the gradients for fusion learning.
arXiv Detail & Related papers (2023-05-11T10:55:34Z) - Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities.
It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities.
Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.