A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
- URL: http://arxiv.org/abs/2602.23300v1
- Date: Thu, 26 Feb 2026 18:08:40 GMT
- Title: A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
- Authors: Soumya Dutta, Smruthi Balaji, Sriram Ganapathy,
- Abstract summary: We propose a modular Mixture-of-Experts for Recognition of Emotions (MiSTER-E) framework to decouple two core challenges in Emotion Recognition in Conversations (ERC)<n>MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings.<n>The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism.
- Score: 24.302280709646563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z) - A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models [16.195689085967004]
We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models.<n>Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies.
arXiv Detail & Related papers (2026-01-12T14:21:32Z) - Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts [0.0]
We propose a simple yet effective Supervised Mixture of Experts (S-MoE)<n>S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert.<n>We apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST)
arXiv Detail & Related papers (2025-08-05T23:56:11Z) - CMT-LLM: Contextual Multi-Talker ASR Utilizing Large Language Models [23.278483193586887]
We propose a unified framework that combines multi-talker overlapping speech recognition and contextual biasing into a single task.<n>Our approach outperforms traditional contextual biasing methods, achieving a WER of 7.9% on LibriMix and 32.9% on AMI SDM.
arXiv Detail & Related papers (2025-05-31T07:26:44Z) - Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations [1.0690007351232649]
Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy.<n>The proposed Qieemo framework effectively utilizes the pretrained automatic speech recognition (ASR) model which contains naturally frame aligned textual and emotional features.<n>The experimental results on the IEMOCAP dataset demonstrate that Qieemo outperforms the benchmark unimodal, multimodal, and self-supervised models with absolute improvements of 3.0%, 1.2%, and 1.9% respectively.
arXiv Detail & Related papers (2025-03-05T07:02:30Z) - AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations [57.99479708224221]
We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features.
Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics.
arXiv Detail & Related papers (2024-04-12T11:31:18Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models [9.611864685207056]
We propose a novel approach, InstructERC, to reformulate the emotion recognition task from a discriminative framework to a generative framework based on Large Language Models (LLMs)
InstructERC makes three significant contributions: (1) it introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information; (2) we introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations; and (3) Pioneeringly, we unify emotion labels across benchmarks through the feeling wheel to fit real application scenarios.
arXiv Detail & Related papers (2023-09-21T09:22:07Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.