Related papers: A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models

A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models

URL: http://arxiv.org/abs/2601.07565v1
Date: Mon, 12 Jan 2026 14:21:32 GMT
Title: A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models
Authors: Jiaqi Qiao, Xiujuan Xu, Xinran Li, Yu Liu,
Abstract summary: We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models.<n>Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies.
Score: 16.195689085967004
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.

Related papers

Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z)
A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations [24.302280709646563]
We propose a modular Mixture-of-Experts for Recognition of Emotions (MiSTER-E) framework to decouple two core challenges in Emotion Recognition in Conversations (ERC)<n>MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings.<n>The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism.
arXiv Detail & Related papers (2026-02-26T18:08:40Z)
Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering [61.0787902713059]
We propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time.<n>Our code is available at http://conctsai.com/multilingualism-in-Mixture-of-Experts-LLMs.
arXiv Detail & Related papers (2026-01-20T15:04:25Z)
ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge [5.217410271468519]
We tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework.<n>We leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities.<n>Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset.
arXiv Detail & Related papers (2025-08-08T03:55:25Z)
MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models [19.241274582769037]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks.<n>We introduce MUCAR, a novel benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios.
arXiv Detail & Related papers (2025-06-20T14:57:41Z)
NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification [17.10113184019939]
Multi-modal object Re-Identification (ReID) aims to obtain accurate identity features across heterogeneous modalities.<n>In this paper, we propose a reliable caption generation pipeline based on attribute confidence.<n>We also propose a novel ReID framework, named NEXT, to model diverse identity patterns.
arXiv Detail & Related papers (2025-05-26T13:52:28Z)
A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition [3.4568313440884837]
We present the Anchor-based Multimodal Embedding with Semantic Synchronization (A-MESS) framework.<n>We first design an Anchor-based Multimodal Embedding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs.<n>We develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimize the process by synchronizing multimodal representation with label descriptions.
arXiv Detail & Related papers (2025-03-25T09:09:30Z)
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z)
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA) To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.