Towards Explainable Fusion and Balanced Learning in Multimodal Sentiment Analysis
- URL: http://arxiv.org/abs/2504.12151v1
- Date: Wed, 16 Apr 2025 15:00:06 GMT
- Title: Towards Explainable Fusion and Balanced Learning in Multimodal Sentiment Analysis
- Authors: Miaosen Luo, Yuncheng Jiang, Sijie Mai,
- Abstract summary: KAN-MCP is a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework.<n>We introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality.<n>This synergy of interpretability and robustness achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2.
- Score: 14.029574339845476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Sentiment Analysis (MSA) faces two critical challenges: the lack of interpretability in the decision logic of multimodal fusion and modality imbalance caused by disparities in inter-modal information density. To address these issues, we propose KAN-MCP, a novel framework that integrates the interpretability of Kolmogorov-Arnold Networks (KAN) with the robustness of the Multimodal Clean Pareto (MCPareto) framework. First, KAN leverages its univariate function decomposition to achieve transparent analysis of cross-modal interactions. This structural design allows direct inspection of feature transformations without relying on external interpretation tools, thereby ensuring both high expressiveness and interpretability. Second, the proposed MCPareto enhances robustness by addressing modality imbalance and noise interference. Specifically, we introduce the Dimensionality Reduction and Denoising Modal Information Bottleneck (DRD-MIB) method, which jointly denoises and reduces feature dimensionality. This approach provides KAN with discriminative low-dimensional inputs to reduce the modeling complexity of KAN while preserving critical sentiment-related information. Furthermore, MCPareto dynamically balances gradient contributions across modalities using the purified features output by DRD-MIB, ensuring lossless transmission of auxiliary signals and effectively alleviating modality imbalance. This synergy of interpretability and robustness not only achieves superior performance on benchmark datasets such as CMU-MOSI, CMU-MOSEI, and CH-SIMS v2 but also offers an intuitive visualization interface through KAN's interpretable architecture.
Related papers
- Technical Approach for the EMI Challenge in the 8th Affective Behavior Analysis in-the-Wild Competition [10.741278852581646]
Emotional Mimicry Intensity (EMI) estimation plays a pivotal role in understanding human social behavior and advancing human-computer interaction.<n>This paper proposes a dual-stage cross-modal alignment framework to address the limitations of existing methods.<n> Experiments on the Hume-Vidmimic2 dataset demonstrate superior performance with an average Pearson coefficient correlation of 0.51 across six emotion dimensions.
arXiv Detail & Related papers (2025-03-13T17:46:16Z) - Modality-Invariant Bidirectional Temporal Representation Distillation Network for Missing Multimodal Sentiment Analysis [6.15602203132432]
We introduce the Modality-Invariant Bidirectional Temporal Representation Distillation Network (MITR-DNet) for Missing Multimodal Sentiment Analysis.
MITR-DNet employs a distillation approach, wherein a complete modality teacher model guides a missing modality student model, ensuring robustness in the presence of modality missing.
arXiv Detail & Related papers (2025-01-07T07:57:16Z) - Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection [70.84835546732738]
RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images.<n>Traditional encoder-decoder architectures may not have adequately considered the robustness against noise originating from defective modalities.<n>We propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy.
arXiv Detail & Related papers (2024-12-02T14:44:39Z) - What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation [16.033361754660316]
Notice is the first Noise-free Text-Image Corruption and Evaluation pipeline for interpretability in Vision-Language Models (VLMs)<n>Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making.<n>This work paves the way for more transparent and interpretable multimodal systems.
arXiv Detail & Related papers (2024-06-24T05:13:19Z) - DA-Flow: Dual Attention Normalizing Flow for Skeleton-based Video Anomaly Detection [52.74152717667157]
We propose a lightweight module called Dual Attention Module (DAM) for capturing cross-dimension interaction relationships in-temporal skeletal data.
It employs the frame attention mechanism to identify the most significant frames and the skeleton attention mechanism to capture broader relationships across fixed partitions with minimal parameters and flops.
arXiv Detail & Related papers (2024-06-05T06:18:03Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - Synergistic Signal Denoising for Multimodal Time Series of Structure
Vibration [9.144905626316534]
This paper introduces a novel deep learning algorithm tailored for the complexities inherent in multimodal vibration signals prevalent in Structural Health Monitoring (SHM)
By amalgamating convolutional and recurrent architectures, the algorithm adeptly captures both localized and prolonged structural behaviors.
Our results showcase significant improvements in predictive accuracy, early damage detection, and adaptability across multiple SHM scenarios.
arXiv Detail & Related papers (2023-08-17T00:41:50Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.