MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
- URL: http://arxiv.org/abs/2510.11579v1
- Date: Mon, 13 Oct 2025 16:23:32 GMT
- Title: MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis
- Authors: Hongyu Zhu, Lin Chen, Mounim A. El-Yacoubi, Mingsheng Shang,
- Abstract summary: Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources.<n>Mixup-based augmentation improves generalization in unimodal tasks, but its direct application to MSA introduces critical challenges.<n>We propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimize sample mixing in multimodal settings.
- Score: 12.472954763643932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Sentiment Analysis (MSA) aims to identify and interpret human emotions by integrating information from heterogeneous data sources such as text, video, and audio. While deep learning models have advanced in network architecture design, they remain heavily limited by scarce multimodal annotated data. Although Mixup-based augmentation improves generalization in unimodal tasks, its direct application to MSA introduces critical challenges: random mixing often amplifies label ambiguity and semantic inconsistency due to the lack of emotion-aware mixing mechanisms. To overcome these issues, we propose MS-Mix, an adaptive, emotion-sensitive augmentation framework that automatically optimizes sample mixing in multimodal settings. The key components of MS-Mix include: (1) a Sentiment-Aware Sample Selection (SASS) strategy that effectively prevents semantic confusion caused by mixing samples with contradictory emotions. (2) a Sentiment Intensity Guided (SIG) module using multi-head self-attention to compute modality-specific mixing ratios dynamically based on their respective emotional intensities. (3) a Sentiment Alignment Loss (SAL) that aligns the prediction distributions across modalities, and incorporates the Kullback-Leibler-based loss as an additional regularization term to train the emotion intensity predictor and the backbone network jointly. Extensive experiments on three benchmark datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms existing methods, establishing a new standard for robust multimodal sentiment augmentation. The source code is available at: https://github.com/HongyuZhu-s/MS-Mix.
Related papers
- SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data [31.146366498415784]
Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation.<n>IMSS faces three key challenges: multimodal imbalance, where dominant modalities suppress fragile ones; intra-class variation in scale, shape, and orientation across modalities; and cross-modal heterogeneity with conflicting cues producing inconsistent semantic responses.<n>We propose the Semantic-Guided Modality-Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra-class variation and reconciling cross-modal inconsistencies through semantic guidance.
arXiv Detail & Related papers (2026-03-03T01:28:21Z) - Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning [9.470507126417292]
We introduce SABER-LLM, a framework designed for robust multimodal reasoning.<n>First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips.<n>Second, we propose the structured evidence decomposition paradigm, which enforces a "perceive-then-reason" separation between evidence extraction and reasoning.
arXiv Detail & Related papers (2026-01-26T10:03:26Z) - Diffusion-Guided Mask-Consistent Paired Mixing for Endoscopic Image Segmentation [57.37991748282666]
We propose a paired, diffusion-guided paradigm that fuses the strengths of sample mixing and diffusion synthesis.<n>For each real image, a synthetic counterpart is generated under the same mask and the pair is used as a controllable input for Mask-Consistent Paired Mixing (MCPMix)<n>This produces a continuous family of intermediate samples that smoothly bridges synthetic and real appearances under shared geometry.
arXiv Detail & Related papers (2025-11-05T06:14:19Z) - Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis [27.11612547025828]
We introduce textbfAdaptive textbfGated textbfFusion textbfNetwork that adaptively adjusts feature weights based on information entropy and modality importance.<n>Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance.
arXiv Detail & Related papers (2025-10-02T05:05:41Z) - Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline [53.74701603784333]
Two major forms of deception dominate: human-crafted misinformation and AI-generated content.<n>We propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception.<n>UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines.
arXiv Detail & Related papers (2025-09-30T09:26:32Z) - Joint Learning using Mixture-of-Expert-Based Representation for Enhanced Speech Generation and Robust Emotion Recognition [54.44798086835314]
Speech emotion recognition (SER) plays a critical role in building emotion-aware speech systems, but its performance degrades significantly under noisy conditions.<n>We propose the Sparse Mixture-of-Experts Representation Integration Technique (Sparse MERIT), a flexible MTL framework that applies frame-wise expert routing over self-supervised speech representations.<n> Experiments on the MSP-Podcast corpus show that Sparse MERIT consistently outperforms baseline models on both SER and SE tasks.
arXiv Detail & Related papers (2025-09-10T10:18:56Z) - FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning [1.912429179274357]
We present textscFLUID-Flow-Latent Unified Integration via Token Distillation for Expert components.<n>textscFLUID contributes three core elements: (1) emphQ-transforms, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time.
arXiv Detail & Related papers (2025-08-10T09:34:17Z) - MoME: Mixture of Multimodal Experts for Cancer Survival Prediction [46.520971457396726]
Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making.
Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separate encoding.
We propose a Biased Progressive Clever (BPE) paradigm, performing encoding and fusion simultaneously.
arXiv Detail & Related papers (2024-06-14T03:44:33Z) - PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches.
PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Harnessing Hard Mixed Samples with Decoupled Regularizer [69.98746081734441]
Mixup is an efficient data augmentation approach that improves the generalization of neural networks by smoothing the decision boundary with mixed data.
In this paper, we propose an efficient mixup objective function with a decoupled regularizer named Decoupled Mixup (DM)
DM can adaptively utilize hard mixed samples to mine discriminative features without losing the original smoothness of mixup.
arXiv Detail & Related papers (2022-03-21T07:12:18Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.