A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models
- URL: http://arxiv.org/abs/2601.07565v1
- Date: Mon, 12 Jan 2026 14:21:32 GMT
- Title: A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models
- Authors: Jiaqi Qiao, Xiujuan Xu, Xinran Li, Yu Liu,
- Abstract summary: We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models.<n>Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies.
- Score: 16.195689085967004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.
Related papers
- Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z) - A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations [24.302280709646563]
We propose a modular Mixture-of-Experts for Recognition of Emotions (MiSTER-E) framework to decouple two core challenges in Emotion Recognition in Conversations (ERC)<n>MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings.<n>The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism.
arXiv Detail & Related papers (2026-02-26T18:08:40Z) - Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering [61.0787902713059]
We propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time.<n>Our code is available at http://conctsai.com/multilingualism-in-Mixture-of-Experts-LLMs.
arXiv Detail & Related papers (2026-01-20T15:04:25Z) - ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge [5.217410271468519]
We tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework.<n>We leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities.<n>Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset.
arXiv Detail & Related papers (2025-08-08T03:55:25Z) - MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models [19.241274582769037]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks.<n>We introduce MUCAR, a novel benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios.
arXiv Detail & Related papers (2025-06-20T14:57:41Z) - NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification [17.10113184019939]
Multi-modal object Re-Identification (ReID) aims to obtain accurate identity features across heterogeneous modalities.<n>In this paper, we propose a reliable caption generation pipeline based on attribute confidence.<n>We also propose a novel ReID framework, named NEXT, to model diverse identity patterns.
arXiv Detail & Related papers (2025-05-26T13:52:28Z) - A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition [3.4568313440884837]
We present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework.<n>We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs.<n>We develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimize the process by synchronizing multimodal representation with label descriptions.
arXiv Detail & Related papers (2025-03-25T09:09:30Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA)
To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements.
To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.